date:20151119

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-19 Thread Rahila Syed

Hello Michael,

I am planning to continue contributing to this feature in any way be it by
reviewing the patch or making one. Though I haven't been able to reply to
the comments or post an updated patch lately. I plan to do that soon.

Thank you,
Rahila

On Thu, Nov 19, 2015 at 1:09 PM, Michael Paquier 
wrote:

> On Thu, Nov 19, 2015 at 4:30 PM, Amit Langote
>  wrote:
> > On 2015/11/19 16:18, Amit Langote wrote:
> >> I'm marking this as "Waiting on author" in the commitfest app. Also,
> let's
> >> hear from more people.
> >
> > Well, it seems Michael beat me to it.
>
> Yeah, some other folks provided feedback that's why I did it.
> @Rahila: are you planning more work on this patch? It seems that at
> this point we're waiting for you. If not, and because I have a couple
> of times waited for long VACUUM jobs to finish on some relations
> without having much information about their progress, I would be fine
> to spend time hacking this thing. That's up to you.
> Regards,
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Marko Tiikkaja


On 11/19/15 4:48 AM, Pavel Stehule wrote:

2015-11-19 4:40 GMT+01:00 Marko Tiikkaja :

I've in the past wanted to listen on all notification channels in the
current database for debugging purposes.  But recently I came across
another use case.  Since having multiple postgres backends listening for
notifications is very inefficient (one Thursday I found out 30% of our CPU
time was spent spinning on s_locks around the notification code), it makes
sense to implement a notification server of sorts which then passes on
notifications from postgres to interested clients.  A server like this
would be a lot more simple to implement if there was a way to announce that
the client wants to see all notifications, regardless of the name of the
channel.

Attached is a very crude proof-of-concept patch implementing this.  Any
thoughts on the idea?



It is looking much more like workaround. Proposed feature isn't bad, but
optimization of current implementation should be better.

Isn't possible to fix internal implementation?


It's probably possible to improve the internal implementation.  But the 
way I see it, it's always going to be less efficient than a dedicated 
process outside the system (or maybe as a background worker?) handing 
out notifications, so I don't see any point in spending my time on that.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Amit Kapila

On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas  wrote:
>
> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila 
wrote:
> > Don't we need the startup cost incase we need to build partial paths for
> > joinpaths like mergepath?
> > Also, I think there are other cases for single relation scan where
startup
> > cost can matter like when there are psuedoconstants in qualification
> > (refer cost_qual_eval_walker()) or let us say if someone has disabled
> > seq scan (disable_cost is considered as startup cost.)
>
> I'm not saying that we don't need to compute it.  I'm saying we don't
> need to take it into consideration when deciding which paths have
> merit.   Note that consider_statup is set this way:
>
> rel->consider_startup = (root->tuple_fraction > 0);
>

Even when consider_startup is false, still startup_cost is used for cost
calc, now may be ignoring that is okay for partial paths, but still it seems
worth thinking why leaving for partial paths it is okay even though it
is used in add_path().

+ *  We don't generate parameterized partial paths because they seem
unlikely
+ *  ever to be
worthwhile.  The only way we could ever use such a path is
+ *  by executing a nested loop with a complete
path on the outer side - thus,
+ *  each worker would scan the entire outer relation - and the partial
path
+ *  on the inner side - thus, each worker would scan only part of the inner
+ *  relation.  This is
silly: a parameterized path is generally going to be
+ *  based on an index scan, and we can't generate a
partial path for that.

Won't it be useful to consider parameterized paths for below kind of
plans where we can push the jointree to worker and each worker can
scan the complete outer relation A and then the rest work is divided
among workers (ofcourse there can be other ways to parallelize such joins,
but still the way described also seems to be possible)?

NestLoop
-> Seq Scan on A
Hash Join
Join Condition: B.Y = C.W
-> Seq Scan on B
-> Index Scan using C_Z_IDX on C
Index Condition: C.Z = A.X

-
Is the main reason to have add_partial_path() is that it has some
less checks or is it that current add_path will give wrong answers
in any case?

If there is no case where add_path can't work, then there is some
advanatge in retaining add_path() atleast in terms of maintining
the code.

+void
+add_partial_path(RelOptInfo *parent_rel, Path *new_path)
{
..
+ /* Unless pathkeys are incompable, keep just one of the two paths. */
..

typo - 'incompable'

> > A.
> > This means that for inheritance child relations for which rel pages are
> > less than parallel_threshold, it will always consider the cost shared
> > between 1 worker and leader as per below calc in cost_seqscan:
> > if (path->parallel_degree > 0)
> > run_cost = run_cost / (path->parallel_degree + 0.5);
> >
> > I think this might not be the appropriate cost model for even for
> > non-inheritence relations which has pages more than parallel_threshold,
> > but it seems to be even worst for inheritance children which have
> > pages less than parallel_threshold
>
> Why?

Because I think the way code is written, it assumes that for each of the
inheritence-child relation which has pages lesser than threshold, half
the work will be done by master-backend which doesn't seem to be the
right distribution.  Consider a case where there are three such children
each having cost 100 to scan, now it will cost them as
100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is
considering 0.5 of master backends work which seems to be wrong.

I think for Append case, we should consider this cost during Append path
creation in create_append_path().  Basically we can make cost_seqscan
to ignore the cost reduction due to parallel_degree for inheritance
relations
and then during Append path creation we can consider it and also consider
work unit of master backend as 0.5 with respect to overall work.

-
--- a/src/backend/optimizer/README
+++ b/src/backend/optimizer/README
+plan as possible.  Expanding the range of cases in which more work can be
+pushed below the Gather (and
costly them accurately) is likely to keep us
+busy for a long time to come.

Seems there is a typo in above text.
/costly/cost

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] warning: HS_KEY redefined (9.5 beta2)

2015-11-19 Thread Erik Rijkers


On 2015-11-19 01:55, Mike Blackwell wrote:

Google says this was present in beta1. (
http://www.postgresql.org/message-id/5596a162.30...@dunslane.net)

Still seems broken, at least for me.

Built with Perl 5.22.
uname -m = x86_64
uname -r = 2.6.32-504.12.2.el6.x86_64



FWIW: Here, Centos 6.6 with gcc 4.4.7 compiles more or less silently 
(emits only the usual scan.c warning).


The same system with gcc 5.2.0 yields a number of warnings:


./configure \
  --prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.REL9_5_STABLE 
\
  
--bindir=/home/aardvark/pg_stuff/pg_installations/pgsql.REL9_5_STABLE/bin 
\
  
--libdir=/home/aardvark/pg_stuff/pg_installations/pgsql.REL9_5_STABLE/lib 
\
  --with-pgport=6545 --quiet --enable-depend --enable-cassert 
--enable-debug \
  --with-extra-version=_REL9_5_STABLE_20151119_0914_f11c557e92c5 
--with-openssl \
  --with-perl --with-libxml --with-libxslt --with-zlib 
--enable-tap-tests



make core: make --quiet -j 8
In file included from gram.y:14861:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:10307:23: warning: unused variable ‘yyg’ [-Wunused-variable]
 struct yyguts_t * yyg = (struct yyguts_t*)yyscanner; /* This var 
may be unused depending upon options. */

   ^
In file included from ../../../src/include/postgres.h:48:0,
 from plperl.c:8:
plperl.c: In function ‘select_perl_context’:
../../../src/include/utils/elog.h:41:16: warning: passing argument 1 of 
‘errmsg’ makes pointer from integer without a cast [-Wint-conversion]

 #define ERROR  20   /* user error - abort transaction; return to
^
plperl.c:643:10: note: in expansion of macro ‘ERROR’
   errmsg(ERROR,
  ^
In file included from ../../../src/include/postgres.h:48:0,
 from plperl.c:8:
../../../src/include/utils/elog.h:146:12: note: expected ‘const char *’ 
but argument is of type ‘int’

 extern int errmsg(const char *fmt,...) pg_attribute_printf(1, 2);
^
All of PostgreSQL successfully made. Ready to install.


make contrib:
In file included from hstore_plperl.c:6:0:
../../contrib/hstore/hstore.h:79:0: warning: "HS_KEY" redefined
 #define HS_KEY(arr_,str_,i_) ((str_) + HSE_OFF((arr_)[2*(i_)]))
 ^
In file included from 
/opt/perl-5.22/lib/5.22.0/x86_64-linux/CORE/perl.h:3730:0,

 from ../../src/pl/plperl/plperl.h:48,
 from hstore_plperl.c:4:
/opt/perl-5.22/lib/5.22.0/x86_64-linux/CORE/util.h:226:0: note: this is 
the location of the previous definition

 #  define HS_KEY(setxsubfn, popmark, apiver, xsver) \
 ^

make check:  All 157 tests passed.



Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Selective logical replication

2015-11-19 Thread konstantin knizhnik

Hi,

I want to use logical replication for implementing multimaster (so all nodes 
are both sending and receiving changes).
But there is one "stupid" problem: how to prevent infinite recursion and not to 
rereplicate replicated data.
I.e. node receives change set from some other node, applies it within some 
transaction, it is  written to the log and ... is replicated to other nodes.
In my experiments with receiver_raw/decoder_raw infinite recursion actually 
doesn't happen because unique constraint violation.
But it is neither nice, neither efficient way of stopping infinite recursion.

I wonder if there is some better way to prevent some particular transaction 
from been replicated?

Thanks in advance,
Konstantin



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selective logical replication

2015-11-19 Thread Amit Langote

On 2015/11/19 17:48, konstantin knizhnik wrote:
> Hi,
> 
> I want to use logical replication for implementing multimaster (so all nodes 
> are both sending and receiving changes).
> But there is one "stupid" problem: how to prevent infinite recursion and not 
> to rereplicate replicated data.
> I.e. node receives change set from some other node, applies it within some 
> transaction, it is  written to the log and ... is replicated to other nodes.
> In my experiments with receiver_raw/decoder_raw infinite recursion actually 
> doesn't happen because unique constraint violation.
> But it is neither nice, neither efficient way of stopping infinite recursion.
> 
> I wonder if there is some better way to prevent some particular transaction 
> from been replicated?

I wonder if stuff in src/backend/replication/logical/origin.c has anything
to do with being able to help with kind of problems you mention. The file
header comment has the following text in it:

*
* This infrastructure is intended to be used in cooperation with logical
* decoding. When replaying from a remote system the configured origin is
* provided to output plugins, allowing prevention of replication loops and
* other filtering.
*

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Typo in file header comment of replorigindesc.c

2015-11-19 Thread Amit Langote

Attached fixes $SUBJECT:

s/replication_origin.c/origin.c/g

Thanks,
Amit
diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c
index 60cf0f6..479e3b4 100644
--- a/src/backend/access/rmgrdesc/replorigindesc.c
+++ b/src/backend/access/rmgrdesc/replorigindesc.c
@@ -1,7 +1,7 @@
 /*-
  *
  * replorigindesc.c
- *	  rmgr descriptor routines for replication/logical/replication_origin.c
+ *	  rmgr descriptor routines for replication/logical/origin.c
  *
  * Portions Copyright (c) 2015, PostgreSQL Global Development Group
  *

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Amit Kapila

On Thu, Nov 19, 2015 at 1:29 PM, Amit Kapila 
wrote:
>
> On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas 
wrote:
> -
> Is the main reason to have add_partial_path() is that it has some
> less checks or is it that current add_path will give wrong answers
> in any case?
>
> If there is no case where add_path can't work, then there is some
> advanatge in retaining add_path() atleast in terms of maintining
> the code.

To be specific, I mean to say about the logic of add_path().


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Selective logical replication

2015-11-19 Thread Michael Paquier

On Thu, Nov 19, 2015 at 5:48 PM, konstantin knizhnik
 wrote:
> I want to use logical replication for implementing multimaster (so all nodes 
> are both sending and receiving changes).
> But there is one "stupid" problem: how to prevent infinite recursion and not 
> to rereplicate replicated data.
> I.e. node receives change set from some other node, applies it within some 
> transaction, it is  written to the log and ... is replicated to other nodes.
> In my experiments with receiver_raw/decoder_raw infinite recursion actually 
> doesn't happen because unique constraint violation.
> But it is neither nice, neither efficient way of stopping infinite recursion.
>
> I wonder if there is some better way to prevent some particular transaction 
> from been replicated?

What you are looking for to prevent this infinite recursion is the
concept of replication origin:
http://www.postgresql.org/docs/devel/static/replication-origins.html

Regarding receiver_raw/decoder_raw, I never got around to use that and
prevent the problem you are seeing :) But patches are welcome if
provided.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Add --synchronous option to pg_receivexlog, for more reliable WA

2015-11-19 Thread Peter Eisentraut

On 9/15/15 11:04 PM, Fujii Masao wrote:
> If --slot option is specified, pg_receivexlog reports a flush position to
> the server even though --synchronous is not specified. So users might think
> that --synchrnous option is not necessary for synchronous pg_receivexlog
> setting. But that's not true. If --synchronous option is not specified, the
> received WAL data is flushed to the disk only when WAL segment is switched.
> So the transactions on the master need to wait for a long time, i.e.,
> we can think that synchronous pg_receivexlog doesn't work smoothly.
> To avoid such situation, --synchronous option also needs to be specified and
> which makes pg_receivexlog flush WAL data immediately after receiving it.

Thank you for this information.  I hope to have clarified this in the
documentation now.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila  wrote:
> On Thu, Nov 19, 2015 at 12:27 AM, Robert Haas  wrote:
>>
>> On Wed, Nov 18, 2015 at 7:25 AM, Amit Kapila 
>> wrote:
>> > Don't we need the startup cost incase we need to build partial paths for
>> > joinpaths like mergepath?
>> > Also, I think there are other cases for single relation scan where
>> > startup
>> > cost can matter like when there are psuedoconstants in qualification
>> > (refer cost_qual_eval_walker()) or let us say if someone has disabled
>> > seq scan (disable_cost is considered as startup cost.)
>>
>> I'm not saying that we don't need to compute it.  I'm saying we don't
>> need to take it into consideration when deciding which paths have
>> merit.   Note that consider_statup is set this way:
>>
>> rel->consider_startup = (root->tuple_fraction > 0);
>>
>
> Even when consider_startup is false, still startup_cost is used for cost
> calc, now may be ignoring that is okay for partial paths, but still it seems
> worth thinking why leaving for partial paths it is okay even though it
> is used in add_path().

That is explained in the comments, and I just explained it again in my
previous email.  I'm not sure how much clearer I can make it.  For a
regular path, it might sometimes be useful to pick a path with a
higher total cost if it has a lower startup cost.  The reason this
could be useful is because we might not run the resulting plan to
completion.  However, parallel queries are always run to completion,
so a lower startup cost isn't useful.  We just want the path with the
lowest total cost.  I don't know what else to say here unless you can
ask a more specific question.

> Won't it be useful to consider parameterized paths for below kind of
> plans where we can push the jointree to worker and each worker can
> scan the complete outer relation A and then the rest work is divided
> among workers (ofcourse there can be other ways to parallelize such joins,
> but still the way described also seems to be possible)?
>
> NestLoop
> -> Seq Scan on A
> Hash Join
> Join Condition: B.Y = C.W
> -> Seq Scan on B
> -> Index Scan using C_Z_IDX on C
> Index Condition: C.Z = A.X

I had thought that this sort of plan wouldn't actually occur in real
life, but it seems that it does.  What you've written here is a little
muddled - the hash join has no hash underneath, for example, and
there'd have to be some sort of join order restriction in order to
consider a plan of this type.  However, somewhat to my surprise, I was
able to get a plan much like this by doing this:

rhaas=# create table a (x int);
CREATE TABLE
rhaas=# insert into a values (1);
INSERT 0 1
rhaas=# create table b (y int, filler text);
CREATE TABLE
rhaas=# insert into b select g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,100) g;
INSERT 0 100
rhaas=# create table c (z int, w int, filler text);
CREATE TABLE
rhaas=# insert into c select g, g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,100) g;
INSERT 0 100
rhaas=# create index c_z_idx on c (z);
CREATE INDEX
rhaas=# vacuum analyze;
VACUUM
rhaas=# explain analyze select * from A LEFT JOIN (B INNER JOIN C ON
B.Y = C.W) ON C.Z = A.x;
  QUERY PLAN
--
 Nested Loop Left Join  (cost=8.46..26810.48 rows=1 width=152) (actual
time=0.076..166.946 rows=1 loops=1)
   ->  Seq Scan on a  (cost=0.00..1.01 rows=1 width=4) (actual
time=0.015..0.016 rows=1 loops=1)
   ->  Hash Join  (cost=8.46..26809.47 rows=1 width=148) (actual
time=0.057..166.925 rows=1 loops=1)
 Hash Cond: (b.y = c.w)
 ->  Seq Scan on b  (cost=0.00..23051.00 rows=100
width=72) (actual time=0.012..89.013 rows=100 loops=1)
 ->  Hash  (cost=8.44..8.44 rows=1 width=76) (actual
time=0.035..0.035 rows=1 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 9kB
   ->  Index Scan using c_z_idx on c  (cost=0.42..8.44
rows=1 width=76) (actual time=0.031..0.032 rows=1 loops=1)
 Index Cond: (z = a.x)
 Planning time: 0.394 ms
 Execution time: 167.015 ms
(11 rows)

This is extremely narrow.  If you have more than one row in A, the
planner doesn't pick a nested loop.  And if you're actually going to
be running queries like this frequently, then you should create an
index on B (Y), which causes you to get an execution time of ~ 0.2 ms
instead of 167 ms, because we generate a parameterized nestloop over
two index scans.  But if you have no index on B (Y) and a does contain
precisely one row, then you can get this sort of plan, and hey, the
runtime is even long enough for parallelism to potentially be useful.

But after thinking about it for a while, I realized that even if you
think

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 11:30 AM, Andres Freund  wrote:
> On November 19, 2015 8:09:38 AM PST, Robert Haas  
> wrote:
>>On Thu, Nov 19, 2015 at 9:04 AM, Ildus Kurbangaliev
>> wrote:
>>> The moving base tranches to shared memory has been discussed many
>>times.
>>> The point is using them later in pg_stat_activity and other
>>monitoring
>>> views.
>>
>>I'm not in agreement with this idea.  Actually, I'd prefer that the
>>tranches live in backend-private memory, not shared memory, so that we
>>could for example add backend-local counters to them if desired.
>
> I don't buy that argument. It'd be nearly trivial to have a 
> backend_tranchestats array, indexed by the tranche id, for such counters.

Hmm, true.

> It's really not particularly convenient to allocate tranches right now. You 
> have to store at least the identifier in shared memory and then redo the 
> registration in each process. Otherwise some processes can't identify them. 
> Which of rather inconvenient of you want to register some at runtime

Sure, that's why we're proposing to use an enum or a list of #defines
for that.  I don't see a need to do any more than that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Peter Geoghegan

On Wed, Nov 18, 2015 at 5:22 PM, Greg Stark  wrote:
> On Wed, Nov 18, 2015 at 11:29 PM, Peter Geoghegan  wrote:
>> Other systems expose this explicitly, and, as I said, say in an
>> unqualified way that a multi-pass merge should be avoided. Maybe the
>> warning isn't the right way of communicating that message to the DBA
>> in detail, but I am confident that it ought to be communicated to the
>> DBA fairly clearly.
>
> I'm pretty convinced warnings from DML are a categorically bad idea.
> In any OLTP load they're effectively fatal errors since they'll fill
> up log files or client output or cause other havoc. Or they'll cause
> no problem because nothing is reading them. Neither behaviour is
> useful.

To be clear, this is a LOG level message, not a WARNING.

I think that if the DBA ever sees the multipass_warning message, he or
she does not have an OLTP workload. If you experience what might be
considered log spam due to multipass_warning, then the log spam is the
least of your problems. Besides, log_temp_files is a very similar
setting (albeit one that is not enabled by default), so I tend to
doubt that your view that that style of log message is categorically
bad is widely shared. Having said that, I'm not especially attached to
the idea of communicating the concern to the DBA using the mechanism
of a checkpoint_warning-style LOG message (multipass_warning).

Yes, I really do mean it when I say that the DBA is not supposed to
see this message, no matter how much or how little memory or data is
involved. There is no nuance intended here; it isn't sensible to allow
a multi-pass sort, just as it isn't sensible to allow checkpoints
every 5 seconds. Both of those things can be thought of as thrashing.

> Perhaps the right thing to do is report a statistic to pg_stats so
> DBAs can see how often sorts are in memory, how often they're on disk,
> and how often the on disk sort requires n passes.

That might be better than what I came up with, but I hesitate to track
more things using the statistics collector in the absence of a clear
consensus to do so. I'd be more worried about the overhead of what you
suggest than the overhead of a LOG message, seen only in the case of
something that's really not supposed to happen.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

2015-11-19 Thread Michael Paquier

On Thu, Nov 19, 2015 at 9:22 PM, Marko Tiikkaja  wrote:
> Of course, something might break if we added a new statement type which
> supported RETURNING, but I'm really not worried about that.  I'm not dead
> set against adding some Assert(IsA()) calls here, but I don't see the point.

gram.y has a long comment before select_no_parens regarding why we
shouldn't do it this way, did you notice it? I haven't crafted yet a
backward incompatible query with your patch yet, but something smells
fishy here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Tom Lane

Alvaro Herrera  writes:
> Marko Tiikkaja wrote:
>> On the test server I'm running on, it doesn't look quite as bad as the
>> profiles we had in production, but s_lock is still the worst offender in the
>> profiles, called from:
>> 
>> - 80.33% LWLockAcquire
>> + 48.34% asyncQueueReadAllNotifications
>> + 23.09% SIGetDataEntries
>> + 16.92% SimpleLruReadPage_ReadOnly
>> + 10.21% TransactionIdIsInProgress
>> + 1.27% asyncQueueAdvanceTail
>> 
>> which roughly looks like what I recall from our actual production profiles.

> So the problem is in the bad scalability of LWLock, not in async.c itself?
> In master, the spinlock there has been replaced with atomics; does that branch
> work better?

FWIW, I can't get results anything like this.  With 50 notifiers and 50
listeners, I only see LWLockAcquire eating about 6% of the system-wide
runtime:

-6.23% 0.64% 11850   postmaster  postgres   
   [.] LWLockAcquire
   - LWLockAcquire
  + 33.16% asyncQueueReadAllNotifications
  + 20.76% SimpleLruReadPage_ReadOnly
  + 18.02% TransactionIdIsInProgress
  + 8.67% VirtualXactLockTableInsert
  + 3.98% ProcArrayEndTransaction
  + 3.03% VirtualXactLockTableCleanup
  + 1.77% PreCommit_Notify
  + 1.68% ProcessCompletedNotifies
  + 1.62% asyncQueueAdvanceTail
  + 1.28% ProcessNotifyInterrupt
  + 1.04% StartTransaction
  + 1.00% LockAcquireExtended
  + 0.96% ProcSleep
  + 0.81% LockReleaseAll
  + 0.69% TransactionIdSetPageStatus
  + 0.54% GetNewTransactionId

There isn't any really sore-thumb place that we might fix.  Interestingly,
a big chunk of the runtime is getting eaten by the kernel, breaking down
more or less like this:

-   36.52% 0.05%   836   postmaster  [kernel.kallsyms]  
   [k] system_call_fastpath
   - system_call_fastpath
  + 37.47% __GI___kill
  + 27.26% __poll
  + 9.55% __write_nocancel
  + 8.87% __GI___semop
  + 7.31% __read_nocancel
  + 5.14% __libc_recv
  + 3.77% __libc_send
  + 0.53% __GI___setitimer

The kill() calls are evidently from async.c's SignalBackends() while most
of the other kernel activity is connected to client/server I/O.

Since the clients aren't doing any actual database work, just notify and
receive notifies, this doesn't seem like a representative workload.
So I can't get that excited about how asyncQueueReadAllNotifications and
subsidiary TransactionIdIsInProgress tests cause a large fraction of the
LWLock acquisitions --- that's only true because nothing else very useful
is happening.

BTW, this is HEAD, but I tried 9.4 quickly and didn't really see anything
much different.  Not sure why the discrepancy with Marko's results.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typo in file header comment of replorigindesc.c

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 5:04 AM, Amit Langote
 wrote:
> Attached fixes $SUBJECT:
>
> s/replication_origin.c/origin.c/g

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typo in file header comment of replorigindesc.c

2015-11-19 Thread Andres Freund

On 2015-11-20 10:25:20 +0900, Amit Langote wrote:
> On 2015/11/20 10:24, Robert Haas wrote:
> > On Thu, Nov 19, 2015 at 5:04 AM, Amit Langote
> >  wrote:
> >> Attached fixes $SUBJECT:
> >>
> >> s/replication_origin.c/origin.c/g
> > 
> > Committed.

Thanks Robert and Amit!

I do wonder however if we shouldn't just get rid of all these comments
at some point. They primarily seem to serve as boilerplate that people
get wrong over the evolution of patches.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Peter Geoghegan

On Thu, Nov 19, 2015 at 5:32 PM, Greg Stark  wrote:
> Hm. Have you tested a nearly-sorted input set around 1.5x the size of
> work_mem? That should produce a single run using the heap to generate
> runs but generate two runs if, AIUI, you're just filling work_mem,
> running quicksort, dumping that run entirely and starting fresh.

Yes. Actually, even with a random ordering, on average replacement
selection sort will produce runs twice as long as the patch series.
With nearly ordered input, there is no limit to how log runs can be --
you could definitely have cases where *no* merge step is required. We
just return tuples from one long run. And yet, it isn't worth it in
cases that I tested.

Please don't take my word for it -- try yourself.
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-19 Thread Amit Langote

On 2015/11/20 0:57, Jim Nasby wrote:
> On 11/19/15 1:18 AM, Amit Langote wrote:
>> 1) General purpose interface for (maintenance?) commands to report a set
> 
> I'm surprised no one has picked up on using this for DML. Certainly anyone
> works with ETL processes would love to be able to get some clue on the
> status of a long running query...

Instrumenting query execution for progress info would be a complex beast
though. Also, what kind of reporting interface it would require is also
not clear, at least to me. Jan Urbanski's PGCon presentation[1] is a good
source on the matter I discovered in this thread, thanks! But IMHO, for
now, it would be worthwhile to focus our resources on the modest goal of
implementing a reporting interface for utility commands. Sure it would be
nice to investigate how much the requirements of the two overlap.

> 
>> About pass 2, ie, lazy_index_vacuum() and
>> lazy_vacuum_heap(), I don't see how we can do better than reporting its
>> progress only after finishing all of it without any finer-grained
>> instrumentation. They are essentially block-box as far as the proposed
>> instrumentation approach is concerned. Being able to report progress per
>> index seems good but as a whole, a user would have to wait arbitrarily
>> long before numbers move forward. We might as well just report a bool
>> saying we're about to enter a potentially time-consuming index vacuum
>> round with possibly multiple indexes followed by lazy_vacuum_heap()
>> processing. Additionally, we can report the incremented count of the
>> vacuuming round (pass 2) once we are through it.
> 
> Another option is to provide the means for the index scan routines to
> report their progress. Maybe every index AM won't use it, but it'd
> certainly be a lot better than staring at a long_running boolean.

The boolean would be a workaround for sure. I'm also slightly tempted by
the idea of instrumenting vacuum scans of individual index AM's bulkdelete
methods. One precedent is how vacuum_delay_point() are sprinkled around in
the code. Another problem to solve would be to figure out how to pass
progress parameters around - via some struct or could they be globals just
like VacuumCost* variables are...

> 
>> Note that we can leave them out of
>> percent_done of overall vacuum progress. Until we have a good solution for
>> number (3) above, it seems to difficult to incorporate index pages into
>> overall progress.
> 
> IMHO we need to either put a big caution sign on any % estimate that it
> could be wildly off, or just forgo it completely for now. I'll bet that if
> we don't provide it some enterprising users will figure out the best way
> to do this (similar to how the bloat estimate query has evolved over time).
> 
> Even if we never get a % done indicator, just being able to see what
> 'position' a command is at will be very valuable.

Agreed. If we provide enough information in whatever view we choose to
expose, that would be a good start.

> 
>> As someone pointed out upthread, the final heap truncate phase can take
>> arbitrarily long and is outside the scope of lazy_scan_heap() to
>> instrument. Perhaps a bool, say, waiting_heap_trunc could be reported for
>> the same. Note that, it would have to be reported from lazy_vacuum_rel().
> 
> ISTM this is similar to the problem of reporting index status, namely that
> a progress reporting method needs to accept reports from multiple places
> in the code.

Yes.

Thanks,
Amit

[1] http://www.pgcon.org/2013/schedule/events/576.en.html



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Peter Geoghegan

On Thu, Nov 19, 2015 at 2:53 PM, Peter Geoghegan  wrote:
> The latter 16MB work_mem example query/table is still faster with a
> 12MB work_mem than master, even with multiple passes. Quite a bit
> faster, in fact: about 37 seconds on master, to about 24.7 seconds
> with the patches (same for higher settings short of 16MB).

I made the same comparison with work_mem sizes of 2MB and 6MB for
master/patch, and the patch *still* came out ahead, often by over 10%.
This was more than fair, though, because sometimes the final
on-the-fly merge for the master branch starting at a point at which
the patch series has already completed its sort. (Of course, I don't
believe that any user would ever be well served with such a low
work_mem setting for these queries -- I'm looking for a bad case,
though).

I guess this is a theoretical downside of my approach, that is more
than made up for elsewhere (even leaving aside the final, unrelated
patch in the series, addressing the merge bottleneck directly). So, to
summarize such downsides (downsides of a hybrid sort-merge strategy as
compared to replacement selection):

* As mentioned just now, the fact that there are more runs -- merging
can be slower (although tuples can be returned earlier, which could
also help with CREATE INDEX). This is more of a problem when random
I/O is expensive, and less of a problem when the OS cache buffers
things nicely.

* One run can be created with replacement selection, where a
hyrbid-sort merge strategy needs to create and then merge many runs.
When I started work on this patch, I was pretty sure that case would
be noticeably regressed. I was wrong.

* Abbreviated key comparisons are used less because runs are smaller.
This is why sorts of types like numeric are not especially sympathetic
to the patch. Still, we manage to come out well ahead overall.

You can perhaps show the patch to be almost as slow as the master
branch with a very unsympathetic case involving the union of all these
3. I couldn't regress a case with integers with just the first two,
though.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typo in file header comment of replorigindesc.c

2015-11-19 Thread Amit Langote

On 2015/11/20 10:24, Robert Haas wrote:
> On Thu, Nov 19, 2015 at 5:04 AM, Amit Langote
>  wrote:
>> Attached fixes $SUBJECT:
>>
>> s/replication_origin.c/origin.c/g
> 
> Committed.

Thanks!

Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Greg Stark

On Fri, Nov 20, 2015 at 12:54 AM, Peter Geoghegan  wrote:
> * One run can be created with replacement selection, where a
> hyrbid-sort merge strategy needs to create and then merge many runs.
> When I started work on this patch, I was pretty sure that case would
> be noticeably regressed. I was wrong.

Hm. Have you tested a nearly-sorted input set around 1.5x the size of
work_mem? That should produce a single run using the heap to generate
runs but generate two runs if, AIUI, you're just filling work_mem,
running quicksort, dumping that run entirely and starting fresh.

I don't mean to say it's representative but if you're looking for a
worst case...

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-19 Thread Etsuro Fujita


On 2015/11/20 6:57, Robert Haas wrote:

On Wed, Nov 18, 2015 at 10:54 PM, Etsuro Fujita
 wrote:

Noted, but let's do it that way and move on.  It would be a shame if
we didn't end up with a working FDW join pushdown system in 9.6
because of a disagreement on this point.



Another idea would be to consider join pushdown as unsupported for now when
select-for-update is involved in 9.5, as described in [1], and revisit this
issue when adding join pushdown to postgres_fdw in 9.6.



Well, I think it's probably too late to squeeze this into 9.5 at this
point, but I'm eager to get it fixed for 9.6.


OK, I'll update the postgres_fdw-join-pushdown patch so as to work with 
that callback routine, if needed.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Peter Geoghegan

On Wed, Nov 18, 2015 at 6:19 PM, Robert Haas  wrote:
> On Wed, Nov 18, 2015 at 6:29 PM, Peter Geoghegan  wrote:
>> In principle, I have no problem with doing that. Through testing, I
>> cannot see any actual upside, though. Perhaps I just missed something.
>> Even 8MB is enough to avoid the multipass merge in the event of a
>> surprisingly high volume of data (my work laptop is elsewhere, so I
>> don't have my notes on this in front of me, but I figured out the
>> crossover point for a couple of cases).
>
> I'd be interested in seeing this analysis in some detail.

Sure. Jeff mentioned 8MB as a work_mem setting, so let's examine a
case where that's the work_mem setting, and see experimentally where
the crossover point for a multi-pass sort ends up.

If this table is created:

postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 1010) i;
SELECT 1010

Then, on my system, a work_mem setting of 8MB *just about* avoids
seeing the multipass_warning message with this query:

postgres=# select count(distinct idx) from bar ;

   count

 10,047,433
(1 row)

A work_mem setting of 235MB is just enough to make the query's sort
fully internal.

Let's see how things change with a higher work_mem setting of 16MB. I
mentioned quadratic growth: Having doubled work_mem, let's *quadruple*
the number of tuples, to see where this leaves a 16MB setting WRT a
multi-pass merge:

postgres=# drop table bar ;
DROP TABLE
postgres=# create unlogged table bar as select (random() * 1e9)::int4
idx, 'payload xyz'::text payload from generate_series(1, 1010 * 4)
i;
SELECT 4040

Further experiments show that this is the exact point at which the
16MB work_mem setting similarly narrowly avoids a multi-pass warning.
This should be the dominant consideration, because now a fully
internal sort requires 4X the work_mem of my original 16MB work_mem
example table/query.

The quadratic growth in a simple hybrid sort-merge strategy's ability
to avoid a multi-pass merge phase (growth relative to linear increases
in work_mem) can be demonstrated with simple experiments.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Greg Stark

On Thu, Nov 19, 2015 at 8:35 PM, Greg Stark  wrote:
> Hm. So a bit of back-of-envelope calculation. If we have want to
> buffer at least 1MB for each run -- I think we currently do more
> actually -- and say that a 1GB work_mem ought to be enough to run
> reasonably (that's per sort after all and there might be multiple
> sorts to say nothing of other users on the system). That means we can
> merge about 1,000 runs in the final merge. Each run will be about 2GB
> currently but 1GB if we quicksort the runs. So the largest table we
> can sort in a single pass is 1-2 TB.


For the sake of pedantry I fact checked myself. We calculate the
number of tapes based on wanting to buffer 32 blocks plus overhead so
about 256kB. So the actual maximum you can handle with 1GB of sort_mem
without multiple merges is on the order 4-8TB.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-19 Thread Andres Freund

On 2015-11-19 14:58:05 -0500, Robert Haas wrote:
> On Thu, Nov 19, 2015 at 11:30 AM, Andres Freund  wrote:
> > It's really not particularly convenient to allocate tranches right
> > now. You have to store at least the identifier in shared memory and
> > then redo the registration in each process. Otherwise some processes
> > can't identify them. Which of rather inconvenient of you want to
> > register some at runtime
>
> Sure, that's why we're proposing to use an enum or a list of #defines
> for that.  I don't see a need to do any more than that.

That works fine for builtin stuff, but not at all for extensions doing
it.

If you do register locks at runtime, instead of shared_preload_library -
something you surely agree makes some things easier by not requiring a
restart - you really don't have any way to force the registration to
happen in each backend.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-19 Thread David Steele

On 11/19/15 11:05 AM, Robert Haas wrote:
> On Wed, Nov 18, 2015 at 10:21 AM, Alvaro Herrera
>  wrote:
>> In my days of Perl, it was starting to become frowned upon to call
>> subroutines without parenthesizing arguments.  Is that no longer the
>> case?  Because I notice there are many places in this patch and pre-
>> existing that call psql with an argument list without parens.  And it's
>> a bit odd because I couldn't find any other subroutine that we're using
>> in that way.
> 
> I've been coding in Perl for more than 20 years and have never heard
> of such a rule.

I follow the convention of using parentheses for all function calls in
Perl, though this stems more from my greater familiarity with languages
that require them than any adherence to vague Perl conventions.

I do think it makes the code [more] readable.

-- 
-David
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Simon Riggs

On 19 November 2015 at 01:22, Greg Stark  wrote:


> Perhaps the right thing to do is report a statistic to pg_stats so
> DBAs can see how often sorts are in memory, how often they're on disk,
> and how often the on disk sort requires n passes. That would put them
> in the same category as "sequential scans" for DBAs that expect the
> application to only run index-based OLTP queries for example. The
> problem with this is that sorts are not tied to a particular relation
> and without something to group on the stat will be pretty hard to act
> on.
>

+1

We don't have a message appear when hash joins use go weird, and we
definitely don't want anything like that for sorts either.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Bug in numeric multiplication

2015-11-19 Thread Dean Rasheed

On 18 November 2015 at 22:19, Tom Lane  wrote:
> A bit of progress on this: I've found a test case, namely
>
> select sqrt(99...
>

Wow.

> Still, this proves that we are onto a real problem.
>

Agreed. I had a go at turning that example into something using
log(base, num) so that the result would be visible in a pure SQL test,
but I didn't have any luck.

>> Another issue here is that with outercarry added into the qdigit
>> computation, it's no longer clear what the bounds of qdigit itself are,
>
> I concluded that that particular issue is a red herring: qdigit should
> always be a fairly accurate estimate of the next output digit, so it
> cannot fall very far outside the range 0..NBASE-1.  Testing confirms that
> the range seen in the regression tests is exactly -1 to 1, which is
> what you'd expect since there can be some roundoff error.
>

Right, I came to the same conclusion.

> Also, after further thought I've been able to construct an argument why
> outercarry stays bounded.  See what you think of the comments in the
> attached patch, which incorporates your ideas about postponing the div[qi]
> calculation.

I think the logic is sound, but I worry that that comment is going to
be very difficult to follow.

I had a go at trying to find a simpler approach and came up with the
attached, which computes the value intended for div[qi+1] near the top
of the loop (using doubles) so that it can detect if it will overflow,
and if so it enters the normalisation block. It still needs some work
to prove that the normalised value for fnextdigit won't overflow, but
it feels like a promising, easier-to-follow approach to me. What do
you think?

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index dcdc5cf..02146f7
*** a/src/backend/utils/adt/numeric.c
--- b/src/backend/utils/adt/numeric.c
*** div_var_fast(NumericVar *var1, NumericVa
*** 6186,6192 
  	double		fdividend,
  fdivisor,
  fdivisorinverse,
! fquotient;
  	int			qi;
  	int			i;
  
--- 6186,6193 
  	double		fdividend,
  fdivisor,
  fdivisorinverse,
! fquotient,
! fnextdigit;
  	int			qi;
  	int			i;
  
*** div_var_fast(NumericVar *var1, NumericVa
*** 6274,6279 
--- 6275,6284 
  	 * To avoid overflow in maxdiv itself, it represents the max absolute
  	 * value divided by NBASE-1, ie, at the top of the loop it is known that
  	 * no div[] entry has an absolute value exceeding maxdiv * (NBASE-1).
+ 	 *
+ 	 * Note that maxdiv only includes digit positions that are still part of
+ 	 * the dividend, excluding the first dividend digit, ie, strictly speaking
+ 	 * the above holds only for div[i] where i > qi, at the top of the loop.
  	 */
  	maxdiv = 1;
  
*** div_var_fast(NumericVar *var1, NumericVa
*** 6295,6371 
  		qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
  			(((int) fquotient) - 1);	/* truncate towards -infinity */
  
! 		if (qdigit != 0)
  		{
! 			/* Do we need to normalize now? */
! 			maxdiv += Abs(qdigit);
! 			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
  			{
! /* Yes, do it */
! carry = 0;
! for (i = div_ndigits; i > qi; i--)
  {
! 	newdig = div[i] + carry;
! 	if (newdig < 0)
! 	{
! 		carry = -((-newdig - 1) / NBASE) - 1;
! 		newdig -= carry * NBASE;
! 	}
! 	else if (newdig >= NBASE)
! 	{
! 		carry = newdig / NBASE;
! 		newdig -= carry * NBASE;
! 	}
! 	else
! 		carry = 0;
! 	div[i] = newdig;
  }
! newdig = div[qi] + carry;
! div[qi] = newdig;
! 
! /*
!  * All the div[] digits except possibly div[qi] are now in the
!  * range 0..NBASE-1.
!  */
! maxdiv = Abs(newdig) / (NBASE - 1);
! maxdiv = Max(maxdiv, 1);
! 
! /*
!  * Recompute the quotient digit since new info may have
!  * propagated into the top four dividend digits
!  */
! fdividend = (double) div[qi];
! for (i = 1; i < 4; i++)
  {
! 	fdividend *= NBASE;
! 	if (qi + i <= div_ndigits)
! 		fdividend += (double) div[qi + i];
  }
! /* Compute the (approximate) quotient digit */
! fquotient = fdividend * fdivisorinverse;
! qdigit = (fquotient >= 0.0) ? ((int) fquotient) :
! 	(((int) fquotient) - 1);	/* truncate towards -infinity */
! maxdiv += Abs(qdigit);
  			}
  
! 			/* Subtract off the appropriate multiple of the divisor */
! 			if (qdigit != 0)
! 			{
! int			istop = Min(var2ndigits, div_ndigits - qi + 1);
  
! for (i = 0; i < istop; i++)
! 	div[qi + i] -= qdigit * var2digits[i];
  			}
  		}
  
  		/*
! 		 * The dividend digit we are about to replace might still be nonzero.
! 		 * Fold it into the next digit position.  We don't need to worry about
! 		 * overflow here since this should nearly cancel with the subtraction
! 		 * of the divisor.
  		 */
! 		div[qi + 1] += div[qi] * NBASE;
  
  		div[qi] = qdigit;

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Greg Stark

On Thu, Nov 19, 2015 at 6:56 PM, Peter Geoghegan  wrote:
> Yes, I really do mean it when I say that the DBA is not supposed to
> see this message, no matter how much or how little memory or data is
> involved. There is no nuance intended here; it isn't sensible to allow
> a multi-pass sort, just as it isn't sensible to allow checkpoints
> every 5 seconds. Both of those things can be thought of as thrashing.

Hm. So a bit of back-of-envelope calculation. If we have want to
buffer at least 1MB for each run -- I think we currently do more
actually -- and say that a 1GB work_mem ought to be enough to run
reasonably (that's per sort after all and there might be multiple
sorts to say nothing of other users on the system). That means we can
merge about 1,000 runs in the final merge. Each run will be about 2GB
currently but 1GB if we quicksort the runs. So the largest table we
can sort in a single pass is 1-2 TB.

If we go above those limits we have the choice of buffering less per
run or doing a whole second pass through the data. I suspect we would
get more horsepower out of buffering less though I'm not sure where
the break-even point is. Certainly if we did random I/O for every I/O
that's much more expensive than a factor of 2 over sequential I/O. We
could probably do the math based on random_page_cost and
sequential_page_cost to calculate the minimum amount of buffering
before it's worth doing an extra pass.

So I think you're kind of right and kind of wrong. The vast majority
of use cases are either sub 1TB or are in work environments designed
specifically for data warehouse queries where a user can obtain much
more memory for their queries. However I think it's within the
intended use cases that Postgres should be able to handle a few
terabytes of data on a moderately sized machine in a shared
environment too.

Our current defaults are particularly bad for this though. If you
initdb a new Postgres database today and create a table even a few
gigabytes and try to build an index on it it takes forever. The last
time I did a test I canceled it after it had run for hours, raised
maintenance_work_mem and built the index in a few minutes. The problem
is that if we just raise those limits then people will use more
resources when they don't need it. If it were safer for to have those
limits be much higher then we could make the defaults reflect what
people want when they do bigger jobs rather than just what they want
for normal queries or indexes.

> I think that if the DBA ever sees the multipass_warning message, he or she 
> does not have an OLTP workload.

Hm, that's pretty convincing. I guess this isn't the usual sort of
warning due to the time it would take to trigger.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Peter Geoghegan

On Thu, Nov 19, 2015 at 12:35 PM, Greg Stark  wrote:
> So I think you're kind of right and kind of wrong. The vast majority
> of use cases are either sub 1TB or are in work environments designed
> specifically for data warehouse queries where a user can obtain much
> more memory for their queries. However I think it's within the
> intended use cases that Postgres should be able to handle a few
> terabytes of data on a moderately sized machine in a shared
> environment too.

Maybe I've made this more complicated than it needs to be. The fact is
that my recent 16MB example is still faster than the master branch
when a multiple pass merge is performed (e.g. when work_mem is 15MB,
or even 12MB). More on that later.

> Our current defaults are particularly bad for this though. If you
> initdb a new Postgres database today and create a table even a few
> gigabytes and try to build an index on it it takes forever. The last
> time I did a test I canceled it after it had run for hours, raised
> maintenance_work_mem and built the index in a few minutes. The problem
> is that if we just raise those limits then people will use more
> resources when they don't need it.

I think that the bigger problems are:

* There is a harsh discontinuity in the cost function -- performance
suddenly falls off a cliff when a sort must be performed externally.

* Replacement selection is obsolete. It's very slow on machines from
the last 20 years.

> If it were safer for to have those
> limits be much higher then we could make the defaults reflect what
> people want when they do bigger jobs rather than just what they want
> for normal queries or indexes.

Or better yet, make it so that it doesn't really matter that much,
even while you're still using the same amount of memory as before.

If you're saying that the whole work_mem model isn't a very good one,
then I happen to agree. It would be very nice to have some fancy
admission control feature, but I'd still appreciate a cost model that
dynamically sets work_mem. The model avoids an excessively high
setting where there is only about half the memory available for a 10GB
sort. You should probably have 5 runs sized 2GB, rather than 2 runs
sized 5GB, even if you can afford the memory for the latter. It would
still make sense to have very high work_mem settings when you can
dynamically set it so high that the sort does complete internally,
though.

>> I think that if the DBA ever sees the multipass_warning message, he or she 
>> does not have an OLTP workload.
>
> Hm, that's pretty convincing. I guess this isn't the usual sort of
> warning due to the time it would take to trigger.

I would like more opinions on the multipass_warning message. I can
write a patch that creates a new system view, detailing how sort were
completed, if there is demand.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-19 Thread Robert Haas

On Wed, Nov 18, 2015 at 10:54 PM, Etsuro Fujita
 wrote:
>> Noted, but let's do it that way and move on.  It would be a shame if
>> we didn't end up with a working FDW join pushdown system in 9.6
>> because of a disagreement on this point.
>
> Another idea would be to consider join pushdown as unsupported for now when
> select-for-update is involved in 9.5, as described in [1], and revisit this
> issue when adding join pushdown to postgres_fdw in 9.6.

Well, I think it's probably too late to squeeze this into 9.5 at this
point, but I'm eager to get it fixed for 9.6.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown  wrote:
> On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
>  wrote:
>> Hello,
>>
>> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund  
>> wrote in <20150326175024.gj...@alap3.anarazel.de>
>>> I think the problem here is that the *primary* makes no such
>>> assumptions. Init forks are logged via stuff like
>>>   smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
>> ..
>>> i.e. the data is written out directly to disk, circumventing
>>> shared_buffers. It's pretty bad that we don't do the same on the
>>> standby. For master I think we should just add a bit to the XLOG_FPI
>>> record saying the data should be forced out to disk. I'm less sure
>>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>>> record isn't really an option.
>>
>> The problem exists only for INIT_FORKNUM. So I suppose it is
>> enough to check forknum to decide whether to sync immediately.
>>
>> Specifically for this instance, syncing buffers of INIT_FORKNUM
>> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>>
>> The another (ugly!) solution sould be syncing only buffers for
>> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
>> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
>> though it is a kind of reversion of fast promotion. But buffers
>> to be synced here should be pretty few.
>
> This bug still exists.

Hmm.  This probably should have been on the open items list.  I didn't
pay too much attention this to this before because it seemed like
Andres and Michael were all over it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 4:26 PM, Andres Freund  wrote:
> On 2015-11-19 14:58:05 -0500, Robert Haas wrote:
>> On Thu, Nov 19, 2015 at 11:30 AM, Andres Freund  wrote:
>> > It's really not particularly convenient to allocate tranches right
>> > now. You have to store at least the identifier in shared memory and
>> > then redo the registration in each process. Otherwise some processes
>> > can't identify them. Which of rather inconvenient of you want to
>> > register some at runtime
>>
>> Sure, that's why we're proposing to use an enum or a list of #defines
>> for that.  I don't see a need to do any more than that.
>
> That works fine for builtin stuff, but not at all for extensions doing
> it.
>
> If you do register locks at runtime, instead of shared_preload_library -
> something you surely agree makes some things easier by not requiring a
> restart - you really don't have any way to force the registration to
> happen in each backend.

I'm not exactly clear what you are proposing.  Pick some relatively
small cap on the number of tranches and put an array of them in shared
memory?  Blech.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 6:39 AM, Kouhei Kaigai  wrote:
> So, are you suggesting to make a patch that allows ForeignScan to have
> multiple sub-plans right now? Or, one sub-plan?

Two:

http://www.postgresql.org/message-id/ca+tgmoyzeje+ot1kx4wdob7r7dps0cwxazfqz-14ykfkgkr...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Peter Geoghegan

On Thu, Nov 19, 2015 at 2:35 PM, Robert Haas  wrote:
> OK, so reversing this analysis, with the default work_mem of 4MB, we'd
> need a multi-pass merge for more than 235MB/4 = 58MB of data.  That is
> very, very far from being a can't-happen scenario, and I would not at
> all think it would be acceptable to ignore such a case.

> I think you need to revisit your assumptions here.

Which assumption? Are we talking about multipass_warning, or my patch
series in general? Obviously those are two very differently things. As
I've said, we could address the visibility aspect of this differently.
I'm fine with that.

I'll now talk about my patch series in general -- the actual
consequences of not avoiding a single pass merge phase when the master
branch would have done so.

The latter 16MB work_mem example query/table is still faster with a
12MB work_mem than master, even with multiple passes. Quite a bit
faster, in fact: about 37 seconds on master, to about 24.7 seconds
with the patches (same for higher settings short of 16MB).

Now, that's probably slightly unfair on the master branch, because the
patches still have the benefit of the memory pooling during the merge
phase, which is nothing to do with what we're talking about, and
because my laptop still has plenty of ram.

I should point out that there is no evidence that any case has been
regressed, let alone written off entirely or ignored. I looked. I
probably have not been completely exhaustive, and I'd be willing to
believe there is something that I've missed, but it's still quite
possible that there is no downside to any of this.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ROLES and ALTER DEFAULT PRIVILEGES

2015-11-19 Thread David Fetter

Folks,

The docs for ALTER DEFAULT PRIVILEGES state:

You can change default privileges only for objects that will be
created by yourself or by roles that you are a member of.

but I have not been able to reproduce the "or by roles that you are a
member of" part.  The attached script should create a table tab_one()
which role baz can read.  No such grant occurs.

As I understand the docs, anything created by bar have the same
default privileges as foo, and of any other roles of which bar is a
member.

I think that this is a bug, and that the fix should be back-patched.

What say?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
CREATE ROLE foo WITH LOGIN;

CREATE ROLE bar WITH LOGIN IN ROLE foo;

CREATE ROLE baz WITH LOGIN;

\c - foo

ALTER DEFAULT PRIVILEGES GRANT SELECT ON TABLES TO bar;

\c - bar

CREATE TABLE tab_one();

\ddp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 3:43 PM, Peter Geoghegan  wrote:
>> I'd be interested in seeing this analysis in some detail.
>
> Sure. Jeff mentioned 8MB as a work_mem setting, so let's examine a
> case where that's the work_mem setting, and see experimentally where
> the crossover point for a multi-pass sort ends up.
>
> If this table is created:
>
> postgres=# create unlogged table bar as select (random() * 1e9)::int4
> idx, 'payload xyz'::text payload from generate_series(1, 1010) i;
> SELECT 1010
>
> Then, on my system, a work_mem setting of 8MB *just about* avoids
> seeing the multipass_warning message with this query:
>
> postgres=# select count(distinct idx) from bar ;
>
>count
> 
>  10,047,433
> (1 row)
>
> A work_mem setting of 235MB is just enough to make the query's sort
> fully internal.
>
> Let's see how things change with a higher work_mem setting of 16MB. I
> mentioned quadratic growth: Having doubled work_mem, let's *quadruple*
> the number of tuples, to see where this leaves a 16MB setting WRT a
> multi-pass merge:
>
> postgres=# drop table bar ;
> DROP TABLE
> postgres=# create unlogged table bar as select (random() * 1e9)::int4
> idx, 'payload xyz'::text payload from generate_series(1, 1010 * 4)
> i;
> SELECT 4040
>
> Further experiments show that this is the exact point at which the
> 16MB work_mem setting similarly narrowly avoids a multi-pass warning.
> This should be the dominant consideration, because now a fully
> internal sort requires 4X the work_mem of my original 16MB work_mem
> example table/query.
>
> The quadratic growth in a simple hybrid sort-merge strategy's ability
> to avoid a multi-pass merge phase (growth relative to linear increases
> in work_mem) can be demonstrated with simple experiments.

OK, so reversing this analysis, with the default work_mem of 4MB, we'd
need a multi-pass merge for more than 235MB/4 = 58MB of data.  That is
very, very far from being a can't-happen scenario, and I would not at
all think it would be acceptable to ignore such a case.  Even ignoring
the possibility that someone with work_mem = 8MB will try to sort
235MB of data strikes me as out of the question.  Those seem like
entirely reasonable things for users to do.  Greg's example of someone
with work_mem = 1GB trying to sort 4TB does not seem like a crazy
thing to me.  Yeah, in all of those cases you might think that users
should set work_mem higher, but that doesn't mean that they actually
do.  Most systems have to set work_mem very conservatively to make
sure they don't start swapping under heavily load.

I think you need to revisit your assumptions here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in numeric multiplication

2015-11-19 Thread Tom Lane

Dean Rasheed  writes:
> On 18 November 2015 at 22:19, Tom Lane  wrote:
>> Still, this proves that we are onto a real problem.

> Agreed. I had a go at turning that example into something using
> log(base, num) so that the result would be visible in a pure SQL test,
> but I didn't have any luck.

I experimented with that idea some, and found a test case that would
trigger the Assert during log_var's final div_var_fast call:

select log(
exp(.5050012831598249352382434889825483159817)
,
exp(.0099));

However, the emitted answer was the same with or without my proposed
patch.  So I dug deeper by putting in some debug printf's, and found that
the overflow occurs at this point:

div[81] will overflow: div[qi] = 218943 qdigit = 7673 maxdiv = 210375
div[]:
  0001              

           -001    5049 
9871 6840 1750 6476 1756 5110 1745 1684 0183 0860 9567 3786 7660 2671 2843 5974 
6515 4013 7886 0978 7230 5372 8162 7077 2013 6185 3746 3095 8528 1595 3071 7541 
7920 7950 218943 -2103498897 -2103499629 -2103499629 -2103499629 -2103504578

Note that div[qi+1], and indeed all the remaining dividend digits, are
large negative values.  This is what you'd expect if the carry propagation
step hasn't run for awhile, which is a precondition for div[qi] being
large enough to cause an issue.  When we compute 218943 * 1, we will
indeed get an overflow, and the result will wrap around to some large
negative value (2^32 less than it should be).  Then we will add that to
div[qi+1], and we'll get *another* overflow, wrapping what nominally
should have been a negative sum around to a positive value (2^32 more than
it should be).  So the two overflows cancel and we get exactly the correct
new value of div[qi+1].

I do not know whether it's possible to devise a test case where you don't
get offsetting overflows.  It may be that there's no operational bug here.
Still, the code is surely not behaving per face value.

> I had a go at trying to find a simpler approach and came up with the
> attached, which computes the value intended for div[qi+1] near the top
> of the loop (using doubles) so that it can detect if it will overflow,
> and if so it enters the normalisation block. It still needs some work
> to prove that the normalised value for fnextdigit won't overflow, but
> it feels like a promising, easier-to-follow approach to me. What do
> you think?

I'll look at this later ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

2015-11-19 Thread Michael Paquier

On Thu, Nov 19, 2015 at 11:04 AM, Marko Tiikkaja wrote:
> This was discussed in 2010 when CTEs got the same treatment, and I agree
> with what was decided back then.  If someone needs to make PreparableStmt
> different from what COPY and CTEs support, we can split them up.  But
> they're completely identical after this patch, so splitting them up right
> now is somewhat pointless.
> Further, if someone's going to add new stuff to PreparableStmt, she should
> probably think about whether it would make sense to add it to COPY and CTEs
> from day one, too, and in that case not splitting them up is actually a win.

Personally, I would take it on the safe side and actually update it.
If someone were to add a new node type in PreparableStmt I am pretty
sure that we are going to forget to update the COPY part, leading us
to unwelcome bugs. And that would not be cool.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pgbench unusable after crash during pgbench

2015-11-19 Thread Thom Brown

Hi,

I'm using git master, and if I crash the database whilst it's running
pgbench, then restart the database and try to run pgbench again, I
can't:

thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
...crash database...
connection to database "pgbench" failed:
could not connect to server: Connection refused
Is the server running locally and accepting
connections on Unix domain socket "/tmp/.s.PGSQL.5488"?
thom@swift:~/Development/postgresql$ pg_ctl start
pg_ctl: another server might be running; trying to start server anyway
server starting
thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
starting vacuum...end.
setrandom: \setrandom maximum is less than minimum
client 0 aborted in state 1; execution of meta-command failed
transaction type: SELECT only
scaling factor: 0
query mode: simple
number of clients: 1
number of threads: 1
duration: 20 s
number of transactions actually processed: 0

I can otherwise use the database without issue.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-19 Thread Kouhei Kaigai

> On Tue, Nov 17, 2015 at 10:22 PM, Kouhei Kaigai  wrote:
> > Apart from EPQ rechecks, the above aggressive push-down idea allows to send
> > contents of multiple relations to the remote side. In this case, ForeignScan
> > needs to have multiple sub-plans.
> >
> > For example, please assume here is three relations; tbl_A and tbl_B are
> > local and small, tbl_F is remote and large.
> > In case when both of (tbl_A JOIN tbl_F) and (tbl_B JOIN tbl_F) produces
> > large number of rows thus consumes deserved amount of network traffic but
> > (tbl_A JOIN tbl_B JOIN tbl_F) produce small number of rows, the optimal
> > strategy is to send local contents to the remote side once then run
> > a remote query here to produce relatively smaller rows.
> > In the implementation level, ForeignScan shall represent (tbl_A JOIN tbl_B
> > JOIN tbl_F), then it returns a bunch of joined tuples. Its remote query
> > contains VALUES(...) clauses to pack contents of the tbl_A and tbl_B, thus,
> > it needs to be capable to execute underlying multiple scan plans and fetch
> > tuples prior to remote query execution.
> 
> Hmm, maybe.  I'm not entirely sure multiple subplans is the best way
> to implement that, but let's argue about that another day.
>
So, are you suggesting to make a patch that allows ForeignScan to have
multiple sub-plans right now? Or, one sub-plan?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Thom Brown

On 27 March 2015 at 04:54, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Thu, 26 Mar 2015 18:50:24 +0100, Andres Freund  
> wrote in <20150326175024.gj...@alap3.anarazel.de>
>> I think the problem here is that the *primary* makes no such
>> assumptions. Init forks are logged via stuff like
>>   smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
> ..
>> i.e. the data is written out directly to disk, circumventing
>> shared_buffers. It's pretty bad that we don't do the same on the
>> standby. For master I think we should just add a bit to the XLOG_FPI
>> record saying the data should be forced out to disk. I'm less sure
>> what's to be done in the back branches. Flushing every HEAP_NEWPAGE
>> record isn't really an option.
>
> The problem exists only for INIT_FORKNUM. So I suppose it is
> enough to check forknum to decide whether to sync immediately.
>
> Specifically for this instance, syncing buffers of INIT_FORKNUM
> at the end of XLOG_FPI block in xlog_redo fixed the problem.
>
> The another (ugly!) solution sould be syncing only buffers for
> INIT_FORKNUM and is BM_DIRTY in ResetUnlogggedRelations(op =
> UNLOGGED_RELATION_INIT). This is catching-all-at-once solution
> though it is a kind of reversion of fast promotion. But buffers
> to be synced here should be pretty few.

This bug still exists.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

2015-11-19 Thread Marko Tiikkaja


On 11/19/15 1:17 PM, Michael Paquier wrote:

On Thu, Nov 19, 2015 at 11:04 AM, Marko Tiikkaja wrote:

Further, if someone's going to add new stuff to PreparableStmt, she should
probably think about whether it would make sense to add it to COPY and CTEs
from day one, too, and in that case not splitting them up is actually a win.


Personally, I would take it on the safe side and actually update it.
If someone were to add a new node type in PreparableStmt I am pretty
sure that we are going to forget to update the COPY part, leading us
to unwelcome bugs. And that would not be cool.


They'd have to get past this:

+   if (query->commandType != CMD_SELECT &&
+   query->returningList == NIL)
+   {
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("COPY query must have a RETURNING 
clause")));

+   }

Of course, something might break if we added a new statement type which 
supported RETURNING, but I'm really not worried about that.  I'm not 
dead set against adding some Assert(IsA()) calls here, but I don't see 
the point.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] warning: HS_KEY redefined (9.5 beta2)

2015-11-19 Thread Tom Lane

Alvaro Herrera  writes:
> Erik Rijkers wrote:
>> make contrib:
>> In file included from hstore_plperl.c:6:0:
>> ../../contrib/hstore/hstore.h:79:0: warning: "HS_KEY" redefined
>> #define HS_KEY(arr_,str_,i_) ((str_) + HSE_OFF((arr_)[2*(i_)]))

> So we need to get this one fixed.

As for the HS_KEY conflict, I'm not too thrilled with the previous
suggestion of "#undef HS_KEY".  That seems pretty fragile, ie it
depends on inclusion order.  What do people think of doing
"s/HS_KEY/HSTORE_KEY/g"?  (I guess this would also hit the
HS_KEYLEN macro for consistency.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Tom Lane

Marko Tiikkaja  writes:
> I've in the past wanted to listen on all notification channels in the 
> current database for debugging purposes.  But recently I came across 
> another use case.  Since having multiple postgres backends listening for 
> notifications is very inefficient (one Thursday I found out 30% of our 
> CPU time was spent spinning on s_locks around the notification code), it 
> makes sense to implement a notification server of sorts which then 
> passes on notifications from postgres to interested clients.

... and then you gotta get the notifications to the clients, so it seems
like this just leaves the performance question hanging.  Why don't we find
and fix the actual performance issue (assuming that 07e4d03fb wasn't it)?

The reason I'm not terribly enthused about this proposal is that some
implementations of LISTEN (for example, our pre-9.0 one) would be unable
to support LISTEN *.  It's not too hard to imagine that at some point we
might wish to redo the existing implementation to reduce the overhead of
all listeners seeing all messages, and then having promised we could do
LISTEN * would be a serious restriction on our flexibility.  So while
I'm not necessarily trying to veto the idea, I think it has significant
opportunity cost, and I'd like to see a more solid rationale than this
one before we commit to it.

In any case, it would be good to understand exactly what's the performance
issue that's biting you.  Can you provide a test case that reproduces
that behavior?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Minor ON CONFLICT comment tweak

2015-11-19 Thread Peter Geoghegan

Attached patch fixes minor issue with an ON CONFLICT DO UPDATE comment.

-- 
Peter Geoghegan
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index dabaea9..119061e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1012,8 +1012,9 @@ lreplace:;
  *
  * Try to lock tuple for update as part of speculative insertion.  If
  * a qual originating from ON CONFLICT DO UPDATE is satisfied, update
- * (but still lock row, even though it may not satisfy estate's
- * snapshot).
+ * (otherwise, still lock row).  Updated (or merely locked) row may not
+ * satisfy estate's snapshot; only raise an error at higher isolation
+ * levels.
  *
  * Returns true if if we're done (with or without an update), or false if
  * the caller must retry the INSERT from scratch.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-19 Thread Amit Kapila

On Thu, Nov 19, 2015 at 9:29 PM, Robert Haas  wrote:
>
> On Wed, Nov 18, 2015 at 10:41 PM, Amit Kapila 
wrote:
> > I think whats going on here is that when any of the session doesn't
> > get any workers, we shutdown the Gather node which internally destroys
> > the dynamic shared memory segment as well.  However the same is
> > needed as per current design for doing scan by master backend as
> > well.  So I think the fix would be to just do shutdown of workers which
> > actually won't do anything in this scenario.
>
> It seems silly to call ExecGatherShutdownWorkers() here when that's
> going to be a no-op.  I think we should just remove that line and the
> if statement before it altogether and replace it with a comment
> explaining why we can't nuke the DSM at this stage.
>

Isn't it better to destroy the memory for readers array as that gets
allocated
even if there are no workers available for execution?

Attached patch fixes the issue by just destroying readers array.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_early_dsm_destroy_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] documentation for wal_retrieve_retry_interval

2015-11-19 Thread Michael Paquier

On Fri, Nov 20, 2015 at 1:33 AM, Peter Eisentraut  wrote:
> There is no documentation what use case the new (in 9.5) parameter
> wal_retrieve_retry_interval is for.  The commit message
> (5d2b45e3f78a85639f30431181c06d4c3221c5a1) alludes to something, but
> even that is not clear, and obviously in the wrong place.  Could we come
> up with something more to put into the documentation?

Yeah, we should highlight the facts that recovery can be made more
responsive when attempting to detect WAL. In archive recovery, this
can be translated by the fact that new WAL segments can be detected
more quickly and make recovery more responsive. The opposite is
actually what leaded to the patch: requirement was to limit the number
of times archive host was requested with a server that had low
activity, the archive host being on AWS.

An idea would be something like the patch attached. Thoughts?
-- 
Michael

20151120_wal_retry_interval_doc.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DESIGN] ParallelAppend

2015-11-19 Thread Amit Kapila

On Fri, Nov 20, 2015 at 1:25 AM, Robert Haas  wrote:
>
> On Thu, Nov 19, 2015 at 2:59 AM, Amit Kapila 
wrote:
> > Won't it be useful to consider parameterized paths for below kind of
> > plans where we can push the jointree to worker and each worker can
> > scan the complete outer relation A and then the rest work is divided
> > among workers (ofcourse there can be other ways to parallelize such
joins,
> > but still the way described also seems to be possible)?
> >
> > NestLoop
> > -> Seq Scan on A
> > Hash Join
> > Join Condition: B.Y = C.W
> > -> Seq Scan on B
> > -> Index Scan using C_Z_IDX on C
> > Index Condition: C.Z = A.X
>
> I had thought that this sort of plan wouldn't actually occur in real
> life, but it seems that it does.  What you've written here is a little
> muddled - the hash join has no hash underneath, for example, and
> there'd have to be some sort of join order restriction in order to
> consider a plan of this type.  However, somewhat to my surprise, I was
> able to get a plan much like this by doing this:
>
..
>
> So, all in all, I think this isn't a very promising type of plan -
> both because we haven't got the infrastructure to make it safe to
> execute today, and because even if we did have that infrastructure it
> wouldn't be the right choice except in narrow circumstances.
>

I think not only above type of plan, but it would be helpful to parallelize
some other forms of joins ((refer "Parameterized Paths" section in
optimiser/README) as well where parametrized params concept
will be required.  I am not sure if we can say that such cases will be
narrow, so let's leave them, but surely we don't have enough infrastructure
at the moment to parallelize them.


>  We can
> of course revise that decision in the future if things look different
> then.
>

No issues.  The main reason why I brought up this discussion is to
see the possibility of keeping logic of add_partial_path() and add_path()
same, so that it is easy to maintain.  There is no correctness issue here,
so I defer it to you.

> >
> > Because I think the way code is written, it assumes that for each of the
> > inheritence-child relation which has pages lesser than threshold, half
> > the work will be done by master-backend which doesn't seem to be the
> > right distribution.  Consider a case where there are three such children
> > each having cost 100 to scan, now it will cost them as
> > 100/1.5 + 100/1.5 + 100/1.5 which means that per worker, it is
> > considering 0.5 of master backends work which seems to be wrong.
> >
> > I think for Append case, we should consider this cost during Append path
> > creation in create_append_path().  Basically we can make cost_seqscan
> > to ignore the cost reduction due to parallel_degree for inheritance
> > relations
> > and then during Append path creation we can consider it and also
consider
> > work unit of master backend as 0.5 with respect to overall work.
>
> No, I don't think that's right.  It's true that the way we're
> calculating parallel_degree for each relation is unprincipled right
> now, and we need to improve that.  But if it were correct, then what
> we're doing here would also be correct.  If the number of workers
> chosen for each child plan reflected the maximum number that could be
> used effectively by that child plan, then any extras wouldn't speed
> things up even if they were present,
>

Okay, but I think that's not what I am talking about.  I am talking about
below code in cost_seqscan:

- if (nworkers > 0)

- run_cost = run_cost / (nworkers + 0.5);

+ if (path->parallel_degree > 0)

+ run_cost = run_cost / (path->parallel_degree + 0.5);


It will consider 50% of master backends effort for scan of each child
relation,
does that look correct to you?  Wouldn't 50% of master backends effort be
considered to scan all the child relations?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pgbench unusable after crash during pgbench

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 6:03 AM, Thom Brown  wrote:
> I'm using git master, and if I crash the database whilst it's running
> pgbench, then restart the database and try to run pgbench again, I
> can't:
>
> thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
> ...crash database...
> connection to database "pgbench" failed:
> could not connect to server: Connection refused
> Is the server running locally and accepting
> connections on Unix domain socket "/tmp/.s.PGSQL.5488"?
> thom@swift:~/Development/postgresql$ pg_ctl start
> pg_ctl: another server might be running; trying to start server anyway
> server starting
> thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
> starting vacuum...end.
> setrandom: \setrandom maximum is less than minimum
> client 0 aborted in state 1; execution of meta-command failed
> transaction type: SELECT only
> scaling factor: 0
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 20 s
> number of transactions actually processed: 0
>
> I can otherwise use the database without issue.

The only explanation I can think of here is that pgbench on startup
queries one of the tables to figure out the scale factor, and it seems
to be coming up with scaling factor 0, suggesting that the table was
perhaps truncated.  If the tables are unlogged, that's expected.
Otherwise, it sounds like a serious bug in recovery.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending list clean up exposure to SQL

2015-11-19 Thread Alvaro Herrera

Alvaro Herrera wrote:
> Jeff Janes wrote:
> > I've written a function which allows users to clean up the pending list.
> > It takes the index name and returns the number of pending list pages
> > deleted.
> 
> I just noticed that your patch uses AccessShareLock on the index.  Is
> that okay?  I would have assumed that you'd need ShareUpdateExclusive
> (same as vacuum uses), but I don't really know.  Was that a carefully
> thought-out choice?

After reading gitPendingCleanup it becomes clear that there's no need
for a stronger lock than what you've chosen.  Jaime Casanova just
pointed this out to me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Tom Lane

Jim Nasby  writes:
> On 11/19/15 9:57 AM, Tom Lane wrote:
>> Agreed.  We aren't going to accept any core tests that depend on DBI/DBD.
>> Now, that might be a fine example for tests written to test something that
>> uses Postgres ... but not as an example of how to write a core test.

> Isn't one of the goals of the TAP framework to be able to write SQL 
> level tests where we don't have to worry about random output changes, 
> like what line number on a script caused an error?

So?

Actually, I would say that one of the most serious black marks against
every one of the TAP tests we've got is the difficulty of finding out why
a test failed, when that happens.  "ok" versus "not ok" simply doesn't cut
it.  So the idea of not reporting what the actual output was doesn't seem
attractive to me at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Nikolay Shaplov

В письме от 19 ноября 2015 10:57:27 пользователь Tom Lane написал:
> Nikolay Shaplov  writes:
> > But this is not the issue. We can change it any way we like. The question
> > do we need such example at all, or no?
> 
> I'm kind of -1 on the idea of a module that doesn't actually do something
> *useful*. 
Actually it tests that sameuser  option from pg_hba.conf works as expected.  
Allow user to connect only to database with the same name as user name.

I can't say it is something really useful. But my intention was to write an 
example, so I choose the most simple functionality to test, so the user of the 
example can pay all attention to perl related staff. 

I just thought this example might be useful to others. If so, I can prepare it 
for commit. Otherwise I will just use it were I intended.

-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Tom Lane

Marko Tiikkaja  writes:
> On 11/19/15 4:21 PM, Tom Lane wrote:
>> ... and then you gotta get the notifications to the clients, so it seems
>> like this just leaves the performance question hanging.

> I'm not sure which performance question you think is left hanging.  If 
> every client is connected to postgres, you're waking up tens if not 
> hundreds of processes tens if not hundreds of times per second.  Each of 
> them start a transaction, check which notifications are visible in the 
> queue from clog and friends, go through the tails of every other process 
> to see whether they should advance the tail pointer of the queue, commit 
> the transaction and go back to sleep only to be immediately woken up again.

> If they're not connected to postgres directly, this kind of complex 
> processing only happens once, and then the notification server just 
> unconditionally serves all notifications to the clients based on a 
> simple map lookup.  It should be trivial to see how the overhead is avoided.

Meh.  You've gotten rid of one single-point-of-contention by creating
another one.  Essentially, your argument why this is an improvement is
that any random application developer using Postgres is smarter than
the Postgres development team and can therefore produce something
better-performing off the cuff.  Which may indeed be true, but shall we
say it's unproven?

I don't buy the argument about removing transactional overhead, either.
The point of LISTEN/NOTIFY is to inform clients that something has
happened in the database that they ought to react to.  So necessarily,
there's going to be follow-on database activity.  If you're using
LISTEN/NOTIFY to wake up clients who otherwise wouldn't need a DB
connection at all, then you chose the wrong signaling mechanism.

>> In any case, it would be good to understand exactly what's the performance
>> issue that's biting you.  Can you provide a test case that reproduces
>> that behavior?

> I've attached a Go program which simulates quite accurately the 
> LISTEN/NOTIFY-part of our setup.  What it does is:

Thanks, I'll take a look later.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] documentation for wal_retrieve_retry_interval

2015-11-19 Thread Peter Eisentraut

There is no documentation what use case the new (in 9.5) parameter
wal_retrieve_retry_interval is for.  The commit message
(5d2b45e3f78a85639f30431181c06d4c3221c5a1) alludes to something, but
even that is not clear, and obviously in the wrong place.  Could we come
up with something more to put into the documentation?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Alvaro Herrera

Marko Tiikkaja wrote:

> On the test server I'm running on, it doesn't look quite as bad as the
> profiles we had in production, but s_lock is still the worst offender in the
> profiles, called from:
> 
>   - 80.33% LWLockAcquire
> + 48.34% asyncQueueReadAllNotifications
> + 23.09% SIGetDataEntries
> + 16.92% SimpleLruReadPage_ReadOnly
> + 10.21% TransactionIdIsInProgress
> + 1.27% asyncQueueAdvanceTail
> 
> which roughly looks like what I recall from our actual production profiles.

So the problem is in the bad scalability of LWLock, not in async.c itself?
In master, the spinlock there has been replaced with atomics; does that branch
work better?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-19 Thread Andres Freund

On November 19, 2015 8:09:38 AM PST, Robert Haas  wrote:
>On Thu, Nov 19, 2015 at 9:04 AM, Ildus Kurbangaliev
> wrote:
>> The moving base tranches to shared memory has been discussed many
>times.
>> The point is using them later in pg_stat_activity and other
>monitoring
>> views.
>
>I'm not in agreement with this idea.  Actually, I'd prefer that the
>tranches live in backend-private memory, not shared memory, so that we
>could for example add backend-local counters to them if desired.  

I don't buy that argument. It'd be nearly trivial to have a 
backend_tranchestats array, indexed by the tranche id, for such counters.

It's really not particularly convenient to allocate tranches right now. You 
have to store at least the identifier in shared memory and then redo the 
registration in each process. Otherwise some processes can't identify them. 
Which of rather inconvenient of you want to register some at runtime

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending list clean up exposure to SQL

2015-11-19 Thread Jaime Casanova

On 12 August 2015 at 20:19, Jeff Janes  wrote:
>
> But where does this belong?  Core?  Its own separate extension?
>

I will say core. Gin indexes are in core and i don't see why this
function shouldn't.
FWIW, brin indexes has a similar function brin_summarize_new_values() in core

-- 
Jaime Casanova  www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2015-11-19 Thread Kouhei Kaigai

> On Mon, Nov 16, 2015 at 9:03 PM, Kouhei Kaigai  wrote:
> > I guess we will put a pointer to static ExtensibleNodeMethods structure
> > on ForeignScan, CustomScan, CustomScanState and etc...
> 
> I think that makes it confusing.  What I'd prefer to do is have only
> the name stored in the node, and then people who need methods can look
> up those methods based on the name.
>
OK. It is equivalent.

> > Let me write down the steps for deserialization. Please correct me if it is
> > not what you are thinking.
> > First, stringToNode picks up a token that shall have node label, like
> > "SEQSCAN", "CUSTOMSCAN" and others. Once we can identify the following
> > tokens belong to CustomScan node, it can make advance the pg_strtok pointer
> > until "methods" field. The method field is consists of a pair of 
> > library_name
> > and symbol_name, then it tries to load the library and resolve the symbol.
> 
> No.  It reads the "extnodename" field (or whatever we decide to call
> it) and calls GetExtensibleNodeMethods() on that name.  That name
> returns the appropriate structure.  C libraries that get loaded into
> the server can register their extensible node types at _PG_init()
> time.
>
OK, I got.

> > Even if library was not loaded on the worker side, this step allows to 
> > ensure
> > the library gets loaded, thus, PG_init() can RegisterExtensibleNodeMethods.
> 
> A parallel worker will load the same loadable modules which the master
> has got loaded.  If you use some other kind of background worker, of
> course, you're on your own.
>
Sorry, I oversight it. And SerializeLibraryState() and RestoreLibraryState()
indeed allow background workers even if out of parallel context to restore
the library state easily.

> > I have two concerns on the above proposition.
> > 1) 'node_size' field implies user defined structure to have fixed length.
> >Extension may want to use variable length structure. The example below
> >shows GpuJoinState has multiple innerState structure according to the
> >number of inner relations joined at once.
> >https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L240
> 
> OK, we can replace the node_size field with an allocator callback:
> Node (*nodeAlloc)(void) or something like that.
> 
> > 2) It may be valuable if 'nodeRead(void)' can take an argument of the
> >knows/common portion of ForeignScan and etc... It allows extension
> >to adjust number of private fields according to the known part.
> >For example, I can expect flexible length private fields according
> >to the length of custom_subplans list.
> 
> I don't understand this bit.
>
The above two points are for the case if and when extension want to use
variable length fields for its private fields.
So, nodeAlloc() callback is not a perfect answer for the use case because
length of the variable length fields shall be (usually) determined by the
value of another fields (like num_inner_rels, num_gpu_devices, ...) thus
we cannot determine the correct length before read.

Let's assume an custom-scan extension that wants to have:

  typedef struct {
  CustomScancscan;
  int   priv_value_1;
  long  priv_value_2;
  extra_info_t  extra_subplan_info[FLEXIBLE_ARRAY_MEMBER];
  /* its length equal to number of sub-plans */
  } ExampleScan;

The "extnodename" within CustomScan allows to pull callback functions
to handle read node from the serialized format.
However, nodeAlloc() callback cannot determine the correct length
(= number of sub-plans in this example) prior to read 'cscan' part.

So, I'd like to suggest _readCustomScan (and other extendable nodes
also) read tokens on local CustomScan variable once, then call
  Node *(nodeRead)(Node *local_node)
to allocate entire ExampleScan node and read other private fields.

The local_node is already filled up by the core backend prior to
the callback invocation, so extension can know how many sub-plans
are expected thus how many private tokens shall appear.
It also means extension can know exact length of the ExampleScan
node, so it can allocate the node as expected then fill up
remaining private tokens.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-19 Thread Kouhei Kaigai

> On Thu, Nov 19, 2015 at 6:39 AM, Kouhei Kaigai  wrote:
> > So, are you suggesting to make a patch that allows ForeignScan to have
> > multiple sub-plans right now? Or, one sub-plan?
> 
> Two:
> 
> http://www.postgresql.org/message-id/CA+TgmoYZeje+ot1kX4wdoB7R7DPS0CWXAzfqZ-
> 14ykfkgkr...@mail.gmail.com
>
Hmm. Two is a bit mysterious for me because two sub-plans (likely)
means this ForeignScan node checks join clauses and reconstruct
a joined tuple by itself but does not check scan clauses pushed-
down (it is job of inner/outer scan plan, isn't it?).
In this case, how do we treat N-way remote join cases (N>2) if we
assume such a capability in FDW driver?

One subplan means FDW driver run an entire join sub-tree with local
alternative sub-plan; that is my expectation for the majority case.
However, I cannot explain two subplans, but not multiple, well.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-11-19 Thread Michael Paquier

On Fri, Nov 20, 2015 at 7:03 AM, Robert Haas  wrote:
> On Thu, Nov 19, 2015 at 7:19 AM, Thom Brown  wrote:
>> This bug still exists.
>
> Hmm.  This probably should have been on the open items list.

I have added an open item for 9.5. That's not a cool bug to release
the next GA even if that's not limited to 9.5, it is easily
reproducible in older stable branches as well.

> I didn't
> pay too much attention this to this before because it seemed like
> Andres and Michael were all over it.

This completely fell off my radar. Sorry about that.

For back-branches, I tend to agree with what Horiguchi-san mentioned
upthread: we had better issue an unconditional fsync on the relation
when INIT_FORKNUM is used for it when replaying XLOG_FPI record. That
would rather localize the impact. An example of patch is attached that
applies on top of REL9_4_STABLE. That's rather ugly but it shows the
idea and fixes the issue.

For master and perhaps 9.5, I would think that a dedicated WAL record
type enforcing the fsync is the way to go. This special treatment
would go only for btree and spgist when they use INIT_FORKNUM and we
would not impact other relation types and code paths with this
behavior.

Thoughts?
-- 
Michael

20151120_xlog_fpi_replay.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Skip ALTER x SET SCHEMA if the schema didn't change

2015-11-19 Thread Robert Haas

On Wed, Nov 18, 2015 at 12:32 AM, Haribabu Kommi
 wrote:
> On Wed, Nov 18, 2015 at 6:02 AM, Robert Haas  wrote:
>> On Mon, Nov 16, 2015 at 4:27 AM, Marti Raudsepp  wrote:
>>> Thank you so much for the review and patch update. I should have done that
>>> myself, but I've been really busy for the last few weeks. :(
>>
>> Maybe I'm having an attack of the stupids today, but it looks to me
>> like the changes to pg_constraint.c look awfully strange to me.  In
>> the old code, if object_address_present() returns true, we continue,
>> skipping the rest of the loop.  In the new code, we instead set
>> alreadyChanged to true.  That causes both of the following if
>> statements, as revised, to fall out, so that we skip the rest of the
>> loop.  Huh?  Wouldn't a one line change to add oldNspId != newNspId to
>> the criteria for a simple_heap_update be just as good?
>
> Yes, that's correct, the above change can be written as you suggested.
> Updated patch attached with correction.
>
>> Backing up a bit, maybe we should be a bit more vigorous in treating a
>> same-namespace move as a no-op.  That is, don't worry about calling
>> the post-alter hook in that case - just have AlterConstraintNamespaces
>> start by checking whether oldNspId == newNspid right at the top; if
>> so, return.  The patch seems to have the idea that it is important to
>> call the post-alter hook even in that case, but I'm not sure whether
>> that's true.  I'm not sure it's false, but I'm also not sure it's
>> true.
>
> I am also not sure whether calling the post-alter hook in case of constraint 
> is
> necessarily required? but it was doing for other objects, so I suggested
> that way.

OK, committed with some additional cosmetic improvements.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign Data Wrapper

2015-11-19 Thread Big Mike

Inspecting the multicorn source was helpful.

For the purposes of anyone searching the mailing lists, I needed to access
the quals natively. The code that helped me figure out how to do this was
in the native odbc fdw here:
https://github.com/ZhengYang/odbc_fdw/blob/master/odbc_fdw.c

particularly the odbcBeginForeignScan() and odbcIterateForeignScan(), and
odbcGetQual() functions.


Thanks

On Fri, Nov 13, 2015 at 1:24 PM, Corey Huinker 
wrote:

> On Fri, Nov 13, 2015 at 1:46 PM, Big Mike  wrote:
>
>> Writing a Foreign Data Wrapper and interested in isolating the WHERE
>> clause to speed up the access of an indexed file on my filesystem. I'm
>> attempting to understand the inner workings of how the data is retrieved so
>> I'm writing code to just handle one case at the moment: WHERE clause on a
>> single column in the foreign 'table'.
>>
>> SELECT * FROM t WHERE testval = 1
>>
>> I have this code so far, an implementation of the IterateForeignScan
>> interface.
>>
>> static TupleTableSlot *
>> bIterateForeignScan(ForeignScanState *node) {
>> ...
>> RestrictInfo *rinfo = (RestrictInfo *)node->ss.ps.qual;
>> ...
>> }
>>
>> yet am not familiar with what I need to do to pick apart RestrictInfo in
>> order to gather 'testvar', '=', and '1' separately so I can interpret and
>> pass those through to my file parser.
>>
>> Am I going about this the correct way or is there another path I should
>> follow?
>>
>
>
> I would look at http://multicorn.org/ which gives you a working python
> framework. You subclass their ForeignDataWrapper class, override the
> __init__() and execute() functions, and that's about it. The execute()
> function has a list called quals that would set you up for the filtering
> you want to do.
>
> I would get the foreign data wrapper fully debugged this way before
> attempting to refactor in C. And when you do finally re-code, you can see
> what multicorn itself did to implement your filters.
>
>
>
>
>
>
>
>

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-11-19 Thread Jim Nasby


On 11/19/15 1:18 AM, Amit Langote wrote:

1) General purpose interface for (maintenance?) commands to report a set


I'm surprised no one has picked up on using this for DML. Certainly 
anyone works with ETL processes would love to be able to get some clue 
on the status of a long running query...



About pass 2, ie, lazy_index_vacuum() and
lazy_vacuum_heap(), I don't see how we can do better than reporting its
progress only after finishing all of it without any finer-grained
instrumentation. They are essentially block-box as far as the proposed
instrumentation approach is concerned. Being able to report progress per
index seems good but as a whole, a user would have to wait arbitrarily
long before numbers move forward. We might as well just report a bool
saying we're about to enter a potentially time-consuming index vacuum
round with possibly multiple indexes followed by lazy_vacuum_heap()
processing. Additionally, we can report the incremented count of the
vacuuming round (pass 2) once we are through it.


Another option is to provide the means for the index scan routines to 
report their progress. Maybe every index AM won't use it, but it'd 
certainly be a lot better than staring at a long_running boolean.



Note that we can leave them out of
percent_done of overall vacuum progress. Until we have a good solution for
number (3) above, it seems to difficult to incorporate index pages into
overall progress.


IMHO we need to either put a big caution sign on any % estimate that it 
could be wildly off, or just forgo it completely for now. I'll bet that 
if we don't provide it some enterprising users will figure out the best 
way to do this (similar to how the bloat estimate query has evolved over 
time).


Even if we never get a % done indicator, just being able to see what 
'position' a command is at will be very valuable.



As someone pointed out upthread, the final heap truncate phase can take
arbitrarily long and is outside the scope of lazy_scan_heap() to
instrument. Perhaps a bool, say, waiting_heap_trunc could be reported for
the same. Note that, it would have to be reported from lazy_vacuum_rel().


ISTM this is similar to the problem of reporting index status, namely 
that a progress reporting method needs to accept reports from multiple 
places in the code.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-19 Thread Robert Haas

On Wed, Nov 18, 2015 at 10:21 AM, Alvaro Herrera
 wrote:
> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments.  Is that no longer the
> case?  Because I notice there are many places in this patch and pre-
> existing that call psql with an argument list without parens.  And it's
> a bit odd because I couldn't find any other subroutine that we're using
> in that way.

I've been coding in Perl for more than 20 years and have never heard
of such a rule.

Maybe I am not part of the "in" crowd.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Marko Tiikkaja


On 11/19/15 4:21 PM, Tom Lane wrote:

Marko Tiikkaja  writes:

I've in the past wanted to listen on all notification channels in the
current database for debugging purposes.  But recently I came across
another use case.  Since having multiple postgres backends listening for
notifications is very inefficient (one Thursday I found out 30% of our
CPU time was spent spinning on s_locks around the notification code), it
makes sense to implement a notification server of sorts which then
passes on notifications from postgres to interested clients.


... and then you gotta get the notifications to the clients, so it seems
like this just leaves the performance question hanging.


I'm not sure which performance question you think is left hanging.  If 
every client is connected to postgres, you're waking up tens if not 
hundreds of processes tens if not hundreds of times per second.  Each of 
them start a transaction, check which notifications are visible in the 
queue from clog and friends, go through the tails of every other process 
to see whether they should advance the tail pointer of the queue, commit 
the transaction and go back to sleep only to be immediately woken up again.


If they're not connected to postgres directly, this kind of complex 
processing only happens once, and then the notification server just 
unconditionally serves all notifications to the clients based on a 
simple map lookup.  It should be trivial to see how the overhead is avoided.



Why don't we find
and fix the actual performance issue (assuming that 07e4d03fb wasn't it)?


07e4d03fb wasn't it, no.


The reason I'm not terribly enthused about this proposal is that some
implementations of LISTEN (for example, our pre-9.0 one) would be unable
to support LISTEN *.  It's not too hard to imagine that at some point we
might wish to redo the existing implementation to reduce the overhead of
all listeners seeing all messages, and then having promised we could do
LISTEN * would be a serious restriction on our flexibility.  So while
I'm not necessarily trying to veto the idea, I think it has significant
opportunity cost, and I'd like to see a more solid rationale than this
one before we commit to it.


A reasonable point.


In any case, it would be good to understand exactly what's the performance
issue that's biting you.  Can you provide a test case that reproduces
that behavior?


I've attached a Go program which simulates quite accurately the 
LISTEN/NOTIFY-part of our setup.  What it does is:


  1) Open 50 connections, and issue a LISTEN in all of them
  2) Open another 50 connections, and deliver one notification every 
750 milliseconds from each of them


(My apologies for the fact that it's written in Go.  It's the only thing 
I can produce without spending significant amount of time working on this.)


On the test server I'm running on, it doesn't look quite as bad as the 
profiles we had in production, but s_lock is still the worst offender in 
the profiles, called from:


  - 80.33% LWLockAcquire
+ 48.34% asyncQueueReadAllNotifications
+ 23.09% SIGetDataEntries
+ 16.92% SimpleLruReadPage_ReadOnly
+ 10.21% TransactionIdIsInProgress
+ 1.27% asyncQueueAdvanceTail

which roughly looks like what I recall from our actual production profiles.


.m
package main

import (
"github.com/lib/pq"
"database/sql"
"fmt"
"log"
"time"
)

const connInfo = "host=/var/run/postgresql sslmode=disable"

func listener(i int) {
l := pq.NewListener(connInfo, time.Second, time.Second, nil)
err := l.Listen(fmt.Sprintf("channel%d", i))
if err != nil {
log.Fatal(err)
}
for {
<-l.Notify
}
}

func notifier(dbh *sql.DB, i int) {
for {
_, err := dbh.Exec(fmt.Sprintf("NOTIFY channel%d", i))
if err != nil {
log.Fatal(err)
}
time.Sleep(750 * time.Millisecond)
}
}

func main() {
openDb := func() *sql.DB {
db, _ := sql.Open("postgres", connInfo)
err := db.Ping()
if err != nil {
log.Fatal(err)
}
db.SetMaxIdleConns(2)
db.SetMaxOpenConns(2)
return db
}

for i := 0; i < 50; i++ {
go listener(i)
go notifier(openDb(), i)
time.Sleep(20 * time.Millisecond)
}
select{}
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-19 Thread Robert Haas

On Thu, Nov 19, 2015 at 9:04 AM, Ildus Kurbangaliev
 wrote:
> The moving base tranches to shared memory has been discussed many times.
> The point is using them later in pg_stat_activity and other monitoring
> views.

I'm not in agreement with this idea.  Actually, I'd prefer that the
tranches live in backend-private memory, not shared memory, so that we
could for example add backend-local counters to them if desired.  The
SLRU patch is the first time we've put them in shared memory, but it
would be easy to keep only the things that the tranche needs to point
to in shared memory and put the tranche itself back in each backend,
which I tend to think is what we should do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Jim Nasby


On 11/19/15 8:42 AM, Nikolay Shaplov wrote:

+sub psql_ok
+{
+   my $db = shift;
+   my $sql = shift;
+   my $comment = shift;


Isn't the preferred method of parameter assignment to use @_?

Also, I'd think one of the examples should use DBI, since presumably one 
of the big benefits to tap is not dealing with raw psql output...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Nikolay Shaplov

В письме от 19 ноября 2015 09:39:41 пользователь Jim Nasby написал:
> On 11/19/15 8:42 AM, Nikolay Shaplov wrote:
> > +sub psql_ok
> > +{
> > +   my $db = shift;
> > +   my $sql = shift;
> > +   my $comment = shift;
> 
> Isn't the preferred method of parameter assignment to use @_?
Hm... it is the way I wrote perl programs. There is more then one way to do it 
in perl, you know ;-) I think that this way is more understandable for others. 
But this is not the issue. We can change it any way we like. The question do 
we need such example at all, or no?

> Also, I'd think one of the examples should use DBI, since presumably one
> of the big benefits to tap is not dealing with raw psql output...

I wrote this example based on ssl TAP test. There was no DBI there. And I 
think it was done for purpose. If we add DBI to tests, then we should add it 
to build dependencies. And it is not a good idea, and so not a good example.

-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending list clean up exposure to SQL

2015-11-19 Thread Alvaro Herrera

Jeff Janes wrote:
> I've written a function which allows users to clean up the pending list.
> It takes the index name and returns the number of pending list pages
> deleted.

I just noticed that your patch uses AccessShareLock on the index.  Is
that okay?  I would have assumed that you'd need ShareUpdateExclusive
(same as vacuum uses), but I don't really know.  Was that a carefully
thought-out choice?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Tom Lane

Nikolay Shaplov  writes:
> But this is not the issue. We can change it any way we like. The question do 
> we need such example at all, or no?

I'm kind of -1 on the idea of a module that doesn't actually do something
*useful*.  Let's write some actual tests instead, and make them readable
enough that people can steal and repurpose the skeleton easily.

> I wrote this example based on ssl TAP test. There was no DBI there. And I 
> think it was done for purpose. If we add DBI to tests, then we should add it 
> to build dependencies. And it is not a good idea, and so not a good example.

Agreed.  We aren't going to accept any core tests that depend on DBI/DBD.
Now, that might be a fine example for tests written to test something that
uses Postgres ... but not as an example of how to write a core test.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-19 Thread Robert Haas

On Wed, Nov 18, 2015 at 10:41 PM, Amit Kapila  wrote:
> I think whats going on here is that when any of the session doesn't
> get any workers, we shutdown the Gather node which internally destroys
> the dynamic shared memory segment as well.  However the same is
> needed as per current design for doing scan by master backend as
> well.  So I think the fix would be to just do shutdown of workers which
> actually won't do anything in this scenario.

It seems silly to call ExecGatherShutdownWorkers() here when that's
going to be a no-op.  I think we should just remove that line and the
if statement before it altogether and replace it with a comment
explaining why we can't nuke the DSM at this stage.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Jim Nasby

On 11/19/15 9:57 AM, Tom Lane wrote:

>I wrote this example based on ssl TAP test. There was no DBI there. And I
>think it was done for purpose. If we add DBI to tests, then we should add it
>to build dependencies. And it is not a good idea, and so not a good example.

Agreed.  We aren't going to accept any core tests that depend on DBI/DBD.
Now, that might be a fine example for tests written to test something that
uses Postgres ... but not as an example of how to write a core test.

Isn't one of the goals of the TAP framework to be able to write SQL 
level tests where we don't have to worry about random output changes, 
like what line number on a script caused an error?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SPI and transactions

2015-11-19 Thread Robert Haas

On Wed, Nov 18, 2015 at 5:18 AM, Konstantin Knizhnik
 wrote:
> Hello,
>
> SPI was originally developed for execution SQL statements from C user
> defined functions in context of existed transaction.
> This is why it is not possible to execute any transaction manipulation
> statement (BEGIN, COMMIT, PREPARE,...) using
> SPI_execute:SPI_ERROR_TRANSACTION is returned.
>
> But now SPI is used not only inside UDFs. It is also used in background
> workers. For example in receiver_raw, written by Michael Paquier (I lot of
> thanks Michael, understand logical replication without them will be much
> more difficult).
> Right now transactions have to be started by background worker using
> StartTransactionCommand().
> So receiver_raw is not able to preserve master's transaction semantic
> (certainly it can be implemented).
>
> I wonder whether SPI can be extended now to support transaction manipulation
> functions when been called outside transaction context? Or there are some
> principle problem with it?

I think SPI pretty fundamentally assumes we're inside a transaction,
and that we'll still be at the same transaction nesting depth when we
get done with SPI.  For example, SPI_connect() allocates memory in
TopTransactionContext.  So I doubt that it will work out well to try
to solve the problem you're aiming to fix in this particular way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending list clean up exposure to SQL

2015-11-19 Thread Jaime Casanova

On 19 November 2015 at 14:18, Alvaro Herrera  wrote:
> Alvaro Herrera wrote:
>> Jeff Janes wrote:
>> > I've written a function which allows users to clean up the pending list.
>> > It takes the index name and returns the number of pending list pages
>> > deleted.
>>
>> I just noticed that your patch uses AccessShareLock on the index.  Is
>> that okay?  I would have assumed that you'd need ShareUpdateExclusive
>> (same as vacuum uses), but I don't really know.  Was that a carefully
>> thought-out choice?
>
> After reading gitPendingCleanup it becomes clear that there's no need
> for a stronger lock than what you've chosen.  Jaime Casanova just
> pointed this out to me.
>

But it should do some checks, no?
- only superusers?
- what i received as parameter is a GIN index?

-- 
Jaime Casanova  www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench unusable after crash during pgbench

2015-11-19 Thread Tom Lane

Thom Brown  writes:
> On 19 November 2015 at 16:11, Robert Haas  wrote:
>> The only explanation I can think of here is that pgbench on startup
>> queries one of the tables to figure out the scale factor, and it seems
>> to be coming up with scaling factor 0, suggesting that the table was
>> perhaps truncated.  If the tables are unlogged, that's expected.
>> Otherwise, it sounds like a serious bug in recovery.

> Actually yes, that's something I appear to have omitted.  I was using
> unlogged tables, which makes sense now.

> Even so, the errors generated are not at all helpful, and I would
> expect pgbench to handle this case explicitly.

Meh ... it's not very clear how to improve that.  The ":scale" variable is
set from "select count(*) from pgbench_branches"; it's not immediately
obvious that zero should not be an allowed result.  Then the :scale value
is used as the upper limit in a \setrandom script command, and the
complaint about that seems fairly on point.

I do agree that pgbench could do more in the way of showing you the script
command (and line number, maybe) that failed.  Patches welcome.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-19 Thread Mike Blackwell

On Thu, Nov 19, 2015 at 10:05 AM, Robert Haas  wrote:

> On Wed, Nov 18, 2015 at 10:21 AM, Alvaro Herrera
>  wrote:
> > In my days of Perl, it was starting to become frowned upon to call
> > subroutines without parenthesizing arguments.  Is that no longer the
> > case?
>

As I understand it, there are several reasons not to make function calls
in Perl without parenthesis.  Whether they are good reasons is a question
for the user.  Modern Perl  chapter
5 covers most of them.

__
*Mike Blackwell | Technical Analyst, Distribution Services/Rollout
Management | RR Donnelley*
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
mike.blackw...@rrd.com
http://www.rrdonnelley.com

* *

Re: [HACKERS] proposal: LISTEN *

2015-11-19 Thread Marko Tiikkaja


On 11/19/15 5:32 PM, Tom Lane wrote:

Marko Tiikkaja  writes:

On 11/19/15 4:21 PM, Tom Lane wrote:

... and then you gotta get the notifications to the clients, so it seems
like this just leaves the performance question hanging.



I'm not sure which performance question you think is left hanging.  If
every client is connected to postgres, you're waking up tens if not
hundreds of processes tens if not hundreds of times per second.  Each of
them start a transaction, check which notifications are visible in the
queue from clog and friends, go through the tails of every other process
to see whether they should advance the tail pointer of the queue, commit
the transaction and go back to sleep only to be immediately woken up again.



If they're not connected to postgres directly, this kind of complex
processing only happens once, and then the notification server just
unconditionally serves all notifications to the clients based on a
simple map lookup.  It should be trivial to see how the overhead is avoided.


Meh.  You've gotten rid of one single-point-of-contention by creating
another one.  Essentially, your argument why this is an improvement is
that any random application developer using Postgres is smarter than
the Postgres development team and can therefore produce something
better-performing off the cuff.


It's not about who's smarter and who's not.  All a notification server 
has to do is wait for notifications incoming from one socket and make 
sure they're delivered to the correct sockets in an epoll loop.  There's 
only one process being woken up, no need to synchronize with other 
processes, no need to see where the pointers of other processes are, no 
overhead of transaction visibility, transaction BEGIN, or transaction 
COMMIT.  The total amount of work done is trivially smaller.



Which may indeed be true, but shall we
say it's unproven?


Well it's not proof, but every time we moved an application from 
LISTENing in postgres to the notification server our CPU usage halved. 
The last time we did that (from ~100 connections to exactly 1), s_lock 
went down from 30% to 0.16% in our CPU profiles, and our CPU usage is 
only a fraction of what it used to be.  And this is with the 
notification server running on the same server with postgres, so it's 
not cheating, either.


There's no way we could keep running postgres if all 400+ clients 
interested in notifications had a LISTEN open in the database.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench unusable after crash during pgbench

2015-11-19 Thread Thom Brown

On 19 November 2015 at 16:11, Robert Haas  wrote:
> On Thu, Nov 19, 2015 at 6:03 AM, Thom Brown  wrote:
>> I'm using git master, and if I crash the database whilst it's running
>> pgbench, then restart the database and try to run pgbench again, I
>> can't:
>>
>> thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
>> ...crash database...
>> connection to database "pgbench" failed:
>> could not connect to server: Connection refused
>> Is the server running locally and accepting
>> connections on Unix domain socket "/tmp/.s.PGSQL.5488"?
>> thom@swift:~/Development/postgresql$ pg_ctl start
>> pg_ctl: another server might be running; trying to start server anyway
>> server starting
>> thom@swift:~/Development/postgresql$ pgbench -c 1 -j 1 -T 20 -S pgbench
>> starting vacuum...end.
>> setrandom: \setrandom maximum is less than minimum
>> client 0 aborted in state 1; execution of meta-command failed
>> transaction type: SELECT only
>> scaling factor: 0
>> query mode: simple
>> number of clients: 1
>> number of threads: 1
>> duration: 20 s
>> number of transactions actually processed: 0
>>
>> I can otherwise use the database without issue.
>
> The only explanation I can think of here is that pgbench on startup
> queries one of the tables to figure out the scale factor, and it seems
> to be coming up with scaling factor 0, suggesting that the table was
> perhaps truncated.  If the tables are unlogged, that's expected.
> Otherwise, it sounds like a serious bug in recovery.

Actually yes, that's something I appear to have omitted.  I was using
unlogged tables, which makes sense now.

Even so, the errors generated are not at all helpful, and I would
expect pgbench to handle this case explicitly.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GIN pending list clean up exposure to SQL

2015-11-19 Thread Jaime Casanova

On 19 November 2015 at 14:47, Jaime Casanova
 wrote:
> On 19 November 2015 at 14:18, Alvaro Herrera  wrote:
>> Alvaro Herrera wrote:
>>> Jeff Janes wrote:
>>> > I've written a function which allows users to clean up the pending list.
>>> > It takes the index name and returns the number of pending list pages
>>> > deleted.
>>>
>>> I just noticed that your patch uses AccessShareLock on the index.  Is
>>> that okay?  I would have assumed that you'd need ShareUpdateExclusive
>>> (same as vacuum uses), but I don't really know.  Was that a carefully
>>> thought-out choice?
>>
>> After reading gitPendingCleanup it becomes clear that there's no need
>> for a stronger lock than what you've chosen.  Jaime Casanova just
>> pointed this out to me.
>>
>
> But it should do some checks, no?
> - only superusers?
> - what i received as parameter is a GIN index?
>

I just notice this:

+   ginInsertCleanup(, true, );

ginInsertCleanup() now has four parameters, so you should update the call

-- 
Jaime Casanova  www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes && superuser review

2015-11-19 Thread David Steele

On 11/19/15 2:13 AM, Michael Paquier wrote:
> On Thu, Nov 19, 2015 at 7:10 AM, Stephen Frost wrote:
>> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>>> It seems weird to not have a dedicated role for pg_switch_xlog.
>>
>> I didn't add a pg_switch_xlog default role in this patch series, but
>> would be happy to do so if that's the consensus.  It's quite easy to do.
> 
> Agreed. I am not actually getting why that's part of the backup
> actually. That would be more related to archiving, both being
> unrelated concepts. But at this point I guess that's mainly a
> philosophical split.

I wouldn't say that backup and archiving are unrelated since backups
aren't valid without the proper set of WAL segments.

When configuring archiving/backup I use pg_switch_xlog() to verify that
WAL segments are getting to the archive.

I also use pg_switch_xlog() after pg_create_restore_point() to force the
WAL segment containing the restore point record to the archive.
Otherwise it might not be possible (or at least not easy) to recover to
the restore point in case of a problem.  The required WAL segment is
likely to get pushed to the archive before it is needed but I would
rather not bet on that.

-- 
-David
da...@pgmasters.net

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] warning: HS_KEY redefined (9.5 beta2)

2015-11-19 Thread Alvaro Herrera

Erik Rijkers wrote:

> In file included from ../../../src/include/postgres.h:48:0,
>  from plperl.c:8:
> plperl.c: In function ‘select_perl_context’:
> ../../../src/include/utils/elog.h:41:16: warning: passing argument 1 of
> ‘errmsg’ makes pointer from integer without a cast [-Wint-conversion]

Uh, this code has evidently never been compiled before, because it's
completely bogus:

#else
errmsg(ERROR,
   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot allocate multiple Perl 
interpreters on this platform")));
#endif

It's recent breakage though, introduced in 09cecdf285ea.  Need to
s/errmsg/ereport/ ..

> make contrib:
> In file included from hstore_plperl.c:6:0:
> ../../contrib/hstore/hstore.h:79:0: warning: "HS_KEY" redefined
>  #define HS_KEY(arr_,str_,i_) ((str_) + HSE_OFF((arr_)[2*(i_)]))
>  ^
> In file included from
> /opt/perl-5.22/lib/5.22.0/x86_64-linux/CORE/perl.h:3730:0,
>  from ../../src/pl/plperl/plperl.h:48,
>  from hstore_plperl.c:4:
> /opt/perl-5.22/lib/5.22.0/x86_64-linux/CORE/util.h:226:0: note: this is the
> location of the previous definition
>  #  define HS_KEY(setxsubfn, popmark, apiver, xsver) \
>  ^

So we need to get this one fixed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-19 Thread Ildus Kurbangaliev

Thank you for the review. I've made changes according to your comments.
I don't stick on current names in the patch.

I've removed all unnecerrary indirections, and added cache aligning
to LWLock arrays.

On Tue, 17 Nov 2015 19:36:13 +0100
"and...@anarazel.de"  wrote:

> On 2015-11-17 14:14:50 +0300, Ildus Kurbangaliev wrote:
> > 1) We can avoid constants, and use a standard steps for tranches
> > creation.  
> 
> The constants are actually a bit useful, to easily determine
> builtin/non-builtin stuff.

Maybe I'm missing something here, but I don't see much difference with
other tranches, created in Postgres startup. In my opinion they are also
pretty much builtin.

> 
> > 2) We have only one global variable (BufferCtl)  
> 
> Don't see the advantage. This adds another, albeit predictable,
> indirection in frequent callsites. There's no architectural advantage
> in avoiding these.
> 
> > 3) Tranches moved to shared memory, so we won't need to do
> > an additional work here.  
> 
> I can see some benefit, but it also doesn't seem like a huge benefit.

The moving base tranches to shared memory has been discussed many times.
The point is using them later in pg_stat_activity and other monitoring
views.

> 
> 
> > 4) Also we can kept buffer locks from MainLWLockArray there (that
> > was done in attached patch).  
> 
> That's the 'blockLocks' thing? That's a pretty bad name. These are
> buffer mapping locks. And this seems like a separate patch, separately
> debated.

Changed to BufMappingPartitionLWLocks. If this patch is all about
separating LWLocks of the buffer manager, why not use one patch for this
task?

> 
> > +   if (!foundCtl)
> > {
> > -   int i;
> > +   /* Align descriptors to a cacheline boundary. */
> > +   ctl->base.bufferDescriptors = (void *)
> > CACHELINEALIGN(ShmemAlloc(
> > +   NBuffers * sizeof(BufferDescPadded) +
> > PG_CACHE_LINE_SIZE)); +
> > +   ctl->base.bufferBlocks = (char *)
> > ShmemAlloc(NBuffers * (Size) BLCKSZ);  
> 
> I'm going to entirely veto this.
> 
> > +   strncpy(ctl->IOLWLockTrancheName, "buffer_io",
> > +   BUFMGR_MAX_NAME_LENGTH);
> > +   strncpy(ctl->contentLWLockTrancheName,
> > "buffer_content",
> > +   BUFMGR_MAX_NAME_LENGTH);
> > +   strncpy(ctl->blockLWLockTrancheName,
> > "buffer_blocks",
> > +   BUFMGR_MAX_NAME_LENGTH);
> > +
> > +   ctl->IOLocks = (LWLockMinimallyPadded *)
> > ShmemAlloc(
> > +   sizeof(LWLockMinimallyPadded) *
> > NBuffers);  
> 
> This should be cacheline aligned.

Fixed

> 
> > -   buf->io_in_progress_lock = LWLockAssign();
> > -   buf->content_lock = LWLockAssign();
> > +
> > LWLockInitialize(BufferDescriptorGetContentLock(buf),
> > +   ctl->contentLWLockTrancheId);
> > +   LWLockInitialize(>IOLocks[i].lock,
> > +   ctl->IOLWLockTrancheId);  
> 
> Seems weird to use the macro accessing content locks, but not IO
> locks.

Fixed

> 
> >  /* Note: these two macros only work on shared buffers, not local
> > ones! */ -#define BufHdrGetBlock(bufHdr)((Block)
> > (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) +#define
> > BufHdrGetBlock(bufHdr)  ((Block) (BufferCtl->bufferBlocks +
> > ((Size) (bufHdr)->buf_id) * BLCKSZ))  
> 
> That's the additional indirection I'm talking about.
> 
> > @@ -353,9 +353,6 @@ NumLWLocks(void)
> > /* Predefined LWLocks */
> > numLocks = NUM_FIXED_LWLOCKS;
> >  
> > -   /* bufmgr.c needs two for each shared buffer */
> > -   numLocks += 2 * NBuffers;
> > -
> > /* proc.c needs one for each backend or auxiliary process
> > */ numLocks += MaxBackends + NUM_AUXILIARY_PROCS;  
> 
> Didn't you also move the buffer mapping locks... ?
> 
> > diff --git a/src/include/storage/buf_internals.h
> > b/src/include/storage/buf_internals.h index 521ee1c..e1795dc 100644
> > --- a/src/include/storage/buf_internals.h
> > +++ b/src/include/storage/buf_internals.h
> > @@ -95,6 +95,7 @@ typedef struct buftag
> > (a).forkNum == (b).forkNum \
> >  )
> >  
> > +
> >  /*  
> 
> unrelated change.
> 
> >   * The shared buffer mapping table is partitioned to reduce
> > contention.
> >   * To determine which partition lock a given tag requires, compute
> > the tag's @@ -104,10 +105,9 @@ typedef struct buftag
> >  #define BufTableHashPartition(hashcode) \
> > ((hashcode) % NUM_BUFFER_PARTITIONS)
> >  #define BufMappingPartitionLock(hashcode) \
> > -   ([BUFFER_MAPPING_LWLOCK_OFFSET + \
> > -   BufTableHashPartition(hashcode)].lock)
> > +   (&((BufferCtlData
> > *)BufferCtl)->blockLocks[BufTableHashPartition(hashcode)].lock)
> > #define BufMappingPartitionLockByIndex(i) \
> > -   ([BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
> > +   (&((BufferCtlData *)BufferCtl)->blockLocks[i].lock)  
> 
> Again, unnecessary indirections.

Fixed

>

[HACKERS] [PROPOSAL] TAP test example

2015-11-19 Thread Nikolay Shaplov


Don't we need a simple example of TAP tests in src/test ? Something that test 
a trivial feature, but shows basic testing tricks?

While explaining to my friends how does TAP test works I wrote an example TAP 
test. May be we we should add it to the core with sensible README explanation?

-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/src/test/Makefile b/src/test/Makefile
index b713c2c..a45129c 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -17,7 +17,7 @@ SUBDIRS = regress isolation modules
 # We don't build or execute examples/, locale/, or thread/ by default,
 # but we do want "make clean" etc to recurse into them.  Likewise for ssl/,
 # because the SSL test suite is not secure to run on a multi-user system.
-ALWAYS_SUBDIRS = examples locale thread ssl
+ALWAYS_SUBDIRS = examples locale thread ssl tap-examples
 
 # We want to recurse to all subdirs for all standard targets, except that
 # installcheck and install should not recurse into the subdirectory "modules".
diff --git a/src/test/tap-examples/Makefile b/src/test/tap-examples/Makefile
new file mode 100644
index 000..e0d6e10
--- /dev/null
+++ b/src/test/tap-examples/Makefile
@@ -0,0 +1,7 @@
+
+subdir = src/test/tap-examples
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
diff --git a/src/test/tap-examples/TapExamples.pm b/src/test/tap-examples/TapExamples.pm
new file mode 100644
index 000..80fe13b
--- /dev/null
+++ b/src/test/tap-examples/TapExamples.pm
@@ -0,0 +1,34 @@
+package TapExamples;
+
+use strict;
+use TestLib;
+use Test::More;
+
+use Exporter 'import';
+our @EXPORT = qw(
+  psql_ok psql_fails
+);
+
+sub psql_ok
+{
+	my $db = shift;
+	my $sql = shift;
+	my $comment = shift;
+	my $res;
+	eval {$res = psql($db,$sql)};
+	$res = 0 if $@;
+	ok($res, $comment);
+}
+
+sub psql_fails
+{
+	my $db = shift;
+	my $sql = shift;
+	my $comment = shift;
+	my $res;
+	eval {$res = psql($db,$sql)};
+	$res = 0 if $@;
+	ok(!$res, $comment);
+}
+
+1;
diff --git a/src/test/tap-examples/t/001_sameuser_test.pl b/src/test/tap-examples/t/001_sameuser_test.pl
new file mode 100644
index 000..b7236ae
--- /dev/null
+++ b/src/test/tap-examples/t/001_sameuser_test.pl
@@ -0,0 +1,36 @@
+# Minimal test testing streaming replication
+use strict;
+use warnings;
+use TestLib;
+use Test::More "no_plan";
+use TapExamples;
+
+my $tempdir = TestLib::tempdir;
+#my $tempdir = 'my_tmp';
+
+diag "setting up data directory in \"$tempdir\"...";
+
+my $current_user = (getpwuid($>))[0];
+my $db1 = $current_user;
+my $db2 = "${current_user}_another_db";
+
+diag "Running postgres as user '$current_user' with default configuration";
+start_test_server($tempdir);
+
+psql_ok('postgres', "CREATE DATABASE $db1", "Creating DB '$db1'");
+psql_ok('postgres', "CREATE DATABASE $db2", "Creating DB '$db2'");
+
+psql_ok($db1, "select now()", "Connecting to '$db1'");
+psql_ok($db2, "select now()", "Connecting to '$db2'");
+
+diag "Updateing pg_hba.conf, setting 'local sameuser all trust'";
+open HBA, ">$tempdir/pgdata/pg_hba.conf";
+print HBA "# TYPE  DATABASEUSERADDRESS METHOD\n";
+print HBA "local   sameuserall trust \n";
+close HBA;
+
+diag "Restarting postgres";
+restart_test_server();
+
+psql_ok($db1, "select now()", "Connecting to '$db1'");
+psql_fails($db2, "select now()", "Should fail when connecting to '$db2'");

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-19 Thread Masahiko Sawada

On Thu, Nov 19, 2015 at 5:54 AM, Jeff Janes  wrote:
> On Wed, Nov 18, 2015 at 11:18 AM, Jeff Janes  wrote:
>>
>> I get an error when running pg_upgrade from 9.4 to 9.6-this
>>
>> error while copying relation "mediawiki.archive"
>> ("/tmp/data/base/16414/21043_vm" to
>> "/tmp/data_fm/base/16400/21043_vm"): No such file or directory
>
> OK, so the problem seems to be that rewriteVisibilitymap can get
> called with errno already set to a nonzero value.
>
> It never clears it, and then fails at the end despite that no error
> has actually occurred.
>
> Just setting it to 0 at the top of the function seems to be correct
> thing to do.  Or does it need to save the old value and restore it?

Thank you for testing!
I think that the former is better, so attached latest patch.

> But now when I want to do the upgrade faster, I run into this:
>
> "This utility cannot upgrade from PostgreSQL version from 9.5 or
> before to 9.6 or later with link mode."
>
> Is this really an acceptable a tradeoff?  Surely we can arrange to
> link everything else and rewrite just the _vm, which is a tiny portion
> of the data directory.  I don't think that -k promises to link
> everything it possibly can.

I agree.
I've changed the patch so that.
pg_upgarde creates new _vm file and rewrites it even if upgrading to
9.6 with link mode.

Regards,

--
Masahiko Sawada


000_add_frozen_bit_into_visibilitymap_v25.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

86 matches

Mail list logo