Re: [HACKERS] pgbench small bug fix

2016-02-07 Thread Fabien COELHO




Hello Robert,


> While testing for something else I encountered two small bugs under very low
> rate (--rate=0.1). The attached patches fixes these.
> 
>  - when a duration (-T) is specified, ensure that pgbench ends at that

>time (i.e. do not wait for a transaction beyond the end of the run).

Why does this use INSTR_TIME_GET_DOUBLE() and not INSTR_TIME_GET_MICROSEC()?


I choose that because duration is in seconds, but MICROSEC would work fine as
well. See attached version.


Also, why do we really need this change?  Won't the timer expiration
stop the thread at the right time anyway?


Not always. When several threads are used, the time expiration in non-zero
threads is currently triggered after the last schedule transaction, even if this
transaction is long after the expiration, which means it runs overtime:

 sh> time pgbench -T 5 -R 0.1 -P 1 -c 2 -j 2
 starting vacuum...end.
 progress: 1.0 s, 0.0 tps, lat -nan ms stddev -nan, lag -nan ms
 progress: 2.0 s, 1.0 tps, lat 12.129 ms stddev 0.000, lag 1.335 ms
 progress: 3.0 s, 0.0 tps, lat -nan ms stddev -nan, lag -nan ms
 progress: 4.0 s, 0.0 tps, lat -nan ms stddev -nan, lag -nan ms
 progress: 5.0 s, 0.0 tps, lat -nan ms stddev -nan, lag -nan ms
 < 6 seconds wait for a schedule transaction in thread 1
   no report because thread 0 expired...>
 transaction type: TPC-B (sort of)
 scaling factor: 1
 query mode: simple
 number of clients: 2
 number of threads: 2
 duration: 5 s
 number of transactions actually processed: 2
 latency average: 14.648 ms
 latency stddev: 2.518 ms
 rate limit schedule lag: avg 5.598 (max 9.861) ms
 tps = 0.180517 (including connections establishing)
 tps = 0.180566 (excluding connections establishing)

 real0m11.158s
   ^^ 11 seconds, not 5.
 user0m0.041s
 sys 0m0.013s


I mean, sure, in theory it's wasteful for the thread to sit around doing
nothing waiting for the timer to expire, but it's not evident to me that hurts
anything, really.


I compute stats on the output, including the progress report, to check for
responsiveness (eg it is not stuck somewhere because of a checkpoint which
triggered some IO storm or whatever), for that purpose it is better for the
trace to be there as expected.


>  - when there is a progress (-P) report, ensure that all progress
>reports are shown even if no more transactions are schedule.

That's pretty ugly - it would be easy for the test at the top of the
loop to be left out of sync with the similar test inside the loop by
some future patch.


Hmmm. Cannot help it.

A cleaner fix would be to have the main thread do only the progress and periodic
stats aggregation, while other threads would do actual transactions, however
that would break pgbench working without threads, so I think this is a no go.


And again, I wonder why this is really a bug.


Well, if you are fine to set "-T 5 -P 1" and the run not to last 5 seconds nor
print any report every second, then it is not a bug:

 sh> time ./pgbench -T 5 -P 1 -R 0.1 -c 2 -j 2
 starting vacuum...end.
 < UNLUCKY, NO PROGRESS REPORT AT ALL...>
 transaction type: 
 scaling factor: 1
 query mode: simple
 number of clients: 2
 number of threads: 2
 duration: 5 s
 number of transactions actually processed: 1
 latency average = 16.198 ms
 latency stddev = 0.000 ms
 rate limit schedule lag: avg 4.308 (max 4.308) ms
 tps = 0.236361 (including connections establishing)
 tps = 0.236771 (excluding connections establishing)

 real0m4.256s

--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7eb6a2d..08d358f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -1285,6 +1285,12 @@ top:
 		thread->throttle_trigger += wait;
 		st->txn_scheduled = thread->throttle_trigger;
 
+		/* stop client if next transaction is beyond pgbench end of execution */
+		if (duration &&
+			st->txn_scheduled >
+			INSTR_TIME_GET_MICROSEC(thread->start_time) + (int64) 100 * duration)
+			return false;
+
 		/*
 		 * If this --latency-limit is used, and this slot is already late so
 		 * that the transaction will miss the latency limit even if it
@@ -3640,7 +3646,10 @@ threadRun(void *arg)
 		}
 	}
 
-	while (remains > 0)
+	while (remains > 0 ||
+		   /* thread zero keeps on printing progress report if any */
+		   (progress && thread->tid == 0 && duration &&
+			next_report <= thread_start + (int64) duration * 100))
 	{
 		fd_set		input_mask;
 		int			maxsock;	/* max socket number to be waited */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcommitfest reply having corrupted subject line

2016-02-07 Thread Magnus Hagander
On Fri, Feb 5, 2016 at 3:44 AM, Noah Misch  wrote:

> On Thu, Feb 04, 2016 at 09:19:19AM +0100, Magnus Hagander wrote:
> > On Thu, Feb 4, 2016 at 7:26 AM, Noah Misch  wrote:
> > > The following message, which bears "User-Agent: pgcommitfest",
> > >
> > >
> > >
> http://www.postgresql.org/message-id/flat/20160202164101.1291.30526.p...@coridan.postgresql.org
> > >
> > > added spaces after commas in its subject line, compared to the subject
> > > line of
> > > its In-Reply-To message.
> > >
> > >  new subject line: Re: Add generate_series(date, date) and
> > > generate_series(date, date, integer)
> > > previous subject line: Re: Add generate_series(date,date) and
> > > generate_series(date,date,integer)
> > >
> > > I see no way to manually alter the subject line from the
> > > commitfest.postgresql.org review UI.  Is commitfest.postgresql.org
> > > initiating
> > > this change internally?
> > >
> >
> > That is weird.
> >
> > The CF app certainly doesn't do that intentionally - it copies it from
> the
> > archives. And if I look in the db on the cf app, it has the subject
> without
> > the space as the field that it copied :O
> >
> > The code is line 355-358 at:
> >
> http://git.postgresql.org/gitweb/?p=pgcommitfest2.git;a=blob;f=pgcommitfest/commitfest/views.py;h=b4f19b2db470e5269943418b2402ca9ddfff0dff;hb=fec3b2431730c131a206a170a99a7610cdbacc6b#l355
> >
> > the "subject" field in the db that we copy does not have the spaces... I
> > honestly have no idea where they are coming from :O I'm guessing it must
> be
> > something internally in the python libraries that create the MIME.
>
> So it is.  The problem happens when email.mime.text.MIMEText wraps the
> subject
> line; see attached test program.
>

Ouch. So that's basically a bug in the python standard libraries :( It
looks a lot like this one, doesn't it? https://bugs.python.org/issue25257



> > Have you
> > seen this with any other messages, that you can recall, or just this one?
>
> Just this one.  However, no other hackers subject line since 2015-01-01
> contains a comma followed by something other than a space.
>

I wonder in that case if it's worth trying to do something about that in
our code, or just leave it be.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-07 Thread Filip Rembiałkowski
On Sun, Feb 7, 2016 at 11:54 AM, Vik Fearing  wrote:
> On 02/07/2016 03:42 AM, Filip Rembiałkowski wrote:

> You left the duplicate behavior with subtransactions, but didn't mention
> it in the documentation.  If I do  NOTIFY DISTINCT chan, 'msg';  then I
> expect only distinct notifications but I'll get duplicates if I'm in a
> subtransaction.  Either the documentation or the code needs to be fixed.

agreed

>
> I seem to remember some discussion about not using DEFAULT parameters in
> system functions so you should leave the old function alone and create a
> new function with your use_all parameter.  I don't recall the exact
> reason why so hopefully someone else will enlighten me.

I'm quite new to this; how do I pinpoint proper OID for a new catalog
object (function, in this case)?
Is there a better way than browsing fmgr files and guessing next available oid?

>
> There is also no mention in the documentation about what happens if I do:
>
> NOTIFY ALL chan, 'msg';
> NOTIFY ALL chan, 'msg';
> NOTIFY DISTINCT chan, 'msg';
> NOTIFY ALL chan, 'msg';
>
> Without testing, I'd say I'd get two messages, but it should be
> explicitly mentioned somewhere.

If it's four separate transactions, LISTEN'er should get four.
If it's in one transaction,  LISTEN'er should get three.










> --
> Vik Fearing  +33 6 46 75 15 36
> http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-07 Thread Vik Fearing
On 02/07/2016 04:00 PM, Filip Rembiałkowski wrote:
> On Sun, Feb 7, 2016 at 11:54 AM, Vik Fearing  wrote:
>> I seem to remember some discussion about not using DEFAULT parameters in
>> system functions so you should leave the old function alone and create a
>> new function with your use_all parameter.  I don't recall the exact
>> reason why so hopefully someone else will enlighten me.
> 
> I'm quite new to this; how do I pinpoint proper OID for a new catalog
> object (function, in this case)?
> Is there a better way than browsing fmgr files and guessing next available 
> oid?

There is a shell script called `unused_oids` in src/include/catalog/.

>> There is also no mention in the documentation about what happens if I do:
>>
>> NOTIFY ALL chan, 'msg';
>> NOTIFY ALL chan, 'msg';
>> NOTIFY DISTINCT chan, 'msg';
>> NOTIFY ALL chan, 'msg';
>>
>> Without testing, I'd say I'd get two messages, but it should be
>> explicitly mentioned somewhere.
> 
> If it's four separate transactions, LISTEN'er should get four.

The question was for one transaction, I should have been clearer about that.

> If it's in one transaction,  LISTEN'er should get three.

This is surprising to me, I would think it would get only two.  What is
your rationale for three?

Compare with the behavior of:

select 1
union all
select 1
union distinct
select 1
union all
select 1;
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Peter Geoghegan
On Fri, Feb 5, 2016 at 9:31 AM, Robert Haas  wrote:
> Peter, please weigh in and let me know if I've gotten anything
> incorrect here or if you think of other concerns afterwards.

Right. Let me give you the executive summary first: I continue to
believe, following thinking about the matter in detail, that this is a
sensible compromise, that weighs everyone's concerns. It is pretty
close to a win-win. I just need you to confirm what I say here in
turn, so we're sure that we understand each other perfectly.

> The basic idea is that we will add a new GUC with a name like
> replacement_sort_mem that will have a default value in the range of
> 20-30MB; or possibly we will hardcode this value, but for purposes of
> this email I'm going to assume it's a GUC.  If the value of work_mem
> or maintenance_work_mem, whichever applies, is smaller than the value
> of replacement_sort_mem, then the latter has no effect.

By "no effect", you must mean that we always use a heap for the entire
first run (albeit for the tail, with a hybrid quicksort/heap
approach), but still use quicksort for every subsequent run, when it's
clearly established that we aren't going to get one huge run. Is that
correct?

It was my understanding, based on your emphasis on producing only a
single run, as well as your recent remarks on this thread about the
first run being special, that you are really only interested in the
presorted case, where one run is produced. That is, you are basically
not interested in preserving the general ability of replacement
selection to double run size in the event of a uniform distribution.
(That particular doubling property of replacement selection is now
technically lost by virtue of using this new hybrid model *anyway*,
although it will still make runs longer in general).

You don't want to change the behavior of the current patch for the
second or subsequent run; that should remain a quicksort, pure and
simple. Do I have that right?

BTW, parallel sort should probably never use a heap anyway (ISTM that
that will almost certainly be based on external sorts in the end). A
heap is not really compatible with the parallel heap scan model.

> One thing I just thought of (after the call) is that it might be
> better for this GUC to be in units of tuples rather than in units of
> memory; it's not clear to me why the optimal heap size should be
> dependent on the tuple size, so we could have a threshold like 300,000
> tuples or whatever.

I think you're right that a number of tuples is the logical way to
express the heap size (as a GUC unit). I think that the ideal setting
for the GUC is large enough to recognize significant correlations in
input data, which may be clustered, but no larger (at least while
things don't all fit in L1 cache, or maybe L2 cache). We should "go
for broke" with replacement selection -- we don't aim for anything
less than ending up with 1 run by using the heap (merging 2 or 3 runs
rather than 4 or 6 is far less useful, maybe harmful, when one of them
is much larger). Therefore, I don't expect that we'll be practically
disadvantaged by having fewer "hands to juggle" tuples here (we'll
simply almost always have enough in practice -- more on that later).
FWIW I don't think that any benchmark we've seen so far justifies
doing less than "going for broke" with RS, even if you happen to have
a very conservative perspective.

One advantage of a GUC is that you can set it to zero, and always get
a simple hybrid sort-merge strategy if that's desirable. I think that
it might not matter much with multi-gigabyte work_mem settings anyway,
though; you'll just see a small blip. Big (maintenance_)work_mem was
by far my greatest concern in relation to using a heap in general, so
I'm left pretty happy by this plan, I think. Lots of people can afford
a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna
be the most important case overall, by far.

> 2. If (maintenance_)work_mem fills up completely, we will quicksort
> all of the data we have in memory.  We will then regard the tail end
> of that sorted data, in an amount governed by replacement_sort_mem, as
> a heap, and use it to perform replacement selection until no tuples
> remain for the current run.  Meanwhile, the rest of the sorted data
> remains in memory untouched.  Logically, we're constructing a run of
> tuples which is split between memory and disk: the head of the run
> (what fits in all of (maintenance_)work_mem except for
> replacement_sort_mem) is in memory, and the tail of the run is on
> disk.

I went back and forth on this during our call, but I now think that I
was right that there will need to be changes in order to make the tail
of the run a heap (*not* the quicksorted head), because routines like
tuplesort_heap_siftup() assume that state->memtuples[0] is the head of
the heap. This is currently assumed by the master branch for both the
currentRun/nextRun replacement selection heap, as well as the 

[HACKERS] Does plpython support threading?

2016-02-07 Thread Jinhua Luo
Hi,

In my plpython function, I create new thread to do some routine works.
But I found the thread doesn't work, it would run a while, but
suddenly pause and stop working. For example, I write a test dead loop
to print something to file, but it would stop printing after some
time.

I am wondering whether plpython supports threading or not.

Regards,
Jinhua Luo


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-07 Thread Tomas Vondra



On 02/06/2016 10:22 PM, Tom Lane wrote:

Tomas Vondra  writes:

What about using the dense allocation even for the skew buckets,
but not one context for all skew buckets but one per bucket? Then
when we delete a bucket, we simply destroy the context (and free
the chunks, just like we do with the current dense allocator).


Yeah, I was wondering about that too, but it only works if you have
quite a few tuples per skew bucket, else you waste a lot of space.
And you were right upthread that what we're collecting is keys
expected to be common in the outer table, not the inner table. So
it's entirely likely that the typical case is just one inner tuple
per skew bucket. (Should check that out though ...)


I'd argue this is true for vast majority of joins, because the joins 
tend to be on foreign keys with the "dimension" as the inner table, thus 
having exactly one row per skew bucket. The upper bound for number of 
skew buckets is the statistics target (i.e. max number of MCV items). So 
either 100 (default) or possibly up to 1 (max).


For tuples wider than 8kB, we have no problem at all because those 
allocations will be treated as separate chunks and will be freed() 
immediately, making the memory reusable for the dense allocator. If the 
tuples are narrower than 8kB, we get a rather limited amount of memory 
in the skew hash (800kB / 80MB in the extreme cases with the max number 
of MCV items).


So perhaps in this case we don't really need to worry about the 
accounting and memory usage too much.


That of course does not mean we should not try to do better in cases 
when the number of tuples per skew bucket really is high. No doubt we 
can construct such joins. If we could estimate the number of tuples per 
skew bucket, that'd allow us to handle this case differently.


FWIW there was a patch from David some time ago, identifying "unique" 
joins where the join key is unique in the inner relation. That might be 
useful for this, I guess.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-07 Thread Vik Fearing
On 02/07/2016 03:42 AM, Filip Rembiałkowski wrote:
> +1
> 
> ... and a patch (only adding ALL keyword, no hash table implemented yet).

Please stop top-posting, it's very disruptive.  My comments are below,
where they belong.

> On Sat, Feb 6, 2016 at 2:35 PM, Brendan Jurd  wrote:
>> On Sat, 6 Feb 2016 at 12:50 Tom Lane  wrote:
>>>
>>> Robert Haas  writes:
 I agree with what Merlin said about this:

 http://www.postgresql.org/message-id/CAHyXU0yoHe8Qc=yc10ahu1nfia1tbhsg+35ds-oeueuapo7...@mail.gmail.com
>>>
>>> Yeah, I agree that a GUC for this is quite unappetizing.
>>
>>
>> How would you feel about a variant for calling NOTIFY?
>>
>> The SQL syntax could be something like "NOTIFY [ALL] channel, payload" where
>> the ALL means "just send the notification already, nobody cares whether
>> there's an identical one in the queue".
>>
>> Likewise we could introduce a three-argument form of pg_notify(text, text,
>> bool) where the final argument is whether you are interested in removing
>> duplicates.
>>
>> Optimising the remove-duplicates path is still probably a worthwhile
>> endeavour, but if the user really doesn't care at all about duplication, it
>> seems silly to force them to pay any performance price for a behaviour they
>> didn't want, no?

On 02/07/2016 03:42 AM, Filip Rembiałkowski wrote:
> +1
>
> ... and a patch (only adding ALL keyword, no hash table implemented yet).


I only read through the patch, I didn't compile it or test it, but I
have a few comments:

You left the duplicate behavior with subtransactions, but didn't mention
it in the documentation.  If I do  NOTIFY DISTINCT chan, 'msg';  then I
expect only distinct notifications but I'll get duplicates if I'm in a
subtransaction.  Either the documentation or the code needs to be fixed.

I seem to remember some discussion about not using DEFAULT parameters in
system functions so you should leave the old function alone and create a
new function with your use_all parameter.  I don't recall the exact
reason why so hopefully someone else will enlighten me.

There is also no mention in the documentation about what happens if I do:

NOTIFY ALL chan, 'msg';
NOTIFY ALL chan, 'msg';
NOTIFY DISTINCT chan, 'msg';
NOTIFY ALL chan, 'msg';

Without testing, I'd say I'd get two messages, but it should be
explicitly mentioned somewhere.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recently added typedef "string" is a horrid idea

2016-02-07 Thread Robert Haas
On Sun, Feb 7, 2016 at 8:03 AM, Peter Geoghegan  wrote:
> On Sat, Feb 6, 2016 at 2:11 PM, Tom Lane  wrote:
>> Remember that the effects of typedef names are
>> *global*, so far as pgindent is concerned; not only varlena.c will
>> be affected.
>
> I'll remember that in the future.
>
>> Please rename this typedef with some less-generic name.  Probably
>> some of the other identifiers added in the same commit should be
>> adjusted to match.
>
> I suggest "VarString". All the text SortSupport routines were renamed
> to match a pattern of "varstr.*" as part of the commit you mention.
>
> The implication that was intended by the rename is that the relevant
> routines are responsible for about the same cases as the cases handled
> by varstr_cmp(). I tend to mostly think of the text type when looking
> at varstr_cmp(), but it's also used by jsonb, for example, as well as
> char(n). It has a broader purpose; it is used by collatable types
> generally. So, a rename to "VarString" probably makes sense,
> independent of your pgindent concern.
>
> If this sounds like a good idea, I'll produce a patch soon.

VarString is OK with me - I'm not personally wedded to any specific
proposal here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Set search_path + server-prepared statements = cached plan must not change result type

2016-02-07 Thread Robert Haas
On Wed, Jan 27, 2016 at 10:18 PM, David G. Johnston
 wrote:
> Prepare creates a plan and a plan has a known output structure.  What you
> want is an ability to give a name to a parsed but unplanned query.  This is
> not something that prepare should do as it is not a natural extension of its
> present responsibility.

The distinction you're talking about here actually does exist at the
Protocol level.  You can send a Parse message to create a prepared
statement (which is parsed but unplanned), a Bind message to create a
portal (which is planned), and then you can send an Execute message to
execute a previously-created portal.

However, I'm not really sure this helps.  It seems like what Vladimir
wants is basically automatic plan caching.  He wants the server to
re-parse-analyze and re-plan the statement any time that would produce
a different outcome, but ideally also consider holding onto the old
plan in case the search_path or whatever is switched back.  I gather
that the reason he wants to use prepared statements at all is just to
minimize parse-plan overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recently added typedef "string" is a horrid idea

2016-02-07 Thread Peter Geoghegan
On Sat, Feb 6, 2016 at 2:11 PM, Tom Lane  wrote:
> Remember that the effects of typedef names are
> *global*, so far as pgindent is concerned; not only varlena.c will
> be affected.

I'll remember that in the future.

> Please rename this typedef with some less-generic name.  Probably
> some of the other identifiers added in the same commit should be
> adjusted to match.

I suggest "VarString". All the text SortSupport routines were renamed
to match a pattern of "varstr.*" as part of the commit you mention.

The implication that was intended by the rename is that the relevant
routines are responsible for about the same cases as the cases handled
by varstr_cmp(). I tend to mostly think of the text type when looking
at varstr_cmp(), but it's also used by jsonb, for example, as well as
char(n). It has a broader purpose; it is used by collatable types
generally. So, a rename to "VarString" probably makes sense,
independent of your pgindent concern.

If this sounds like a good idea, I'll produce a patch soon.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idle In Transaction Session Timeout, revived

2016-02-07 Thread Craig Ringer
On 4 February 2016 at 09:04, Robert Haas  wrote:


> > I have the same uneasy feeling about it as JD.  However, you could
> > certainly argue that if the client application has lost its marbles
> > to the extent of allowing a transaction to time out, there's no good
> > reason to suppose that it will wake up any time soon, ...
>
> That's exactly what I think.  If you imagine a user who starts a
> transaction and then leaves for lunch, aborting the transaction seems
> nicer than killing the connection.  But what I think really happens is
> some badly-written Java application loses track of a connection
> someplace and just never finds it again.  Again, I'm not averse to
> having both behavior someday, but my gut feeling is that killing the
> connection will be the more useful one.
>


Applications - and users - must be prepared for the fact that uncommitted
data and session state may be lost at any time. The fact that PostgreSQL
tries not to lose it is quite nice, but gives people a false sense of
security too. Someone trips over a cable, a carrier has a bit of a BGP
hiccup, a NAT conntrack timeout occurs, there's an ECC parity check error
causing a proc kill ... your state can go away.

If you really don't want your session terminated, don't set an idle in
transaction session idle timeout (or override it).

(In some ways I think we're too good at this; I really should write an
extension that randomly aborts some low percentage of xacts with fake
deadlocks or serialization failures and randomly kills occasional
connections so that apps actually use their retry paths...)

Sure, it'd be *nice* to just terminate the xact and have a separate param
for timing out idle sessions whether or not they're in an xact. Cleaner -
terminate the xact if there's an xact-related timeout, terminate the
session if there's a session-related timeout. But nobody's written that
patch and this proposal solves a real world problem well enough.
Terminating the xact without terminating the session is a little tricky as
noted earlier so it's not a simple change to switch to that.

I'd be happy to have this. I won't mind having it if we eventually add an
idle_xact_timeout and idle_session_timeout in 9.something too.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Parallel Aggregate

2016-02-07 Thread Robert Haas
On Sun, Jan 24, 2016 at 7:56 PM, Haribabu Kommi
 wrote:
> On Sat, Jan 23, 2016 at 12:59 PM, Haribabu Kommi
>  wrote:
>> Here I attached updated patch with additional combine function for
>> two stage aggregates also.
>
> A wrong combine function was added in pg_aggregate.h in the earlier
> patch that leading to
> initdb problem. Corrected one is attached.

I'm not entirely sure I know what's going on here, but I'm pretty sure
that it makes no sense for the new float8_pl function to reject
non-aggregate callers at the beginning and then have a comment at the
end indicating what it does when not invoked as an aggregate.
Similarly for the other new function.

It would be a lot more clear what this patch was trying to accomplish
if the new functions had header comments explaining their purpose -
not what they do, but why they exist.

float8_regr_pl is labeled in pg_proc.h as an aggregate transition
function, but I'm wondering if it should say combine function.

The changes to pg_aggregate.h include a large number of
whitespace-only changes which are unacceptable.  Please change only
the lines that need to be changed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recently added typedef "string" is a horrid idea

2016-02-07 Thread Tom Lane
Peter Geoghegan  writes:
> On Sat, Feb 6, 2016 at 2:11 PM, Tom Lane  wrote:
>> Please rename this typedef with some less-generic name.  Probably
>> some of the other identifiers added in the same commit should be
>> adjusted to match.

> I suggest "VarString". All the text SortSupport routines were renamed
> to match a pattern of "varstr.*" as part of the commit you mention.

Works for me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] evaluating target list in parallel workers

2016-02-07 Thread Robert Haas
On Thu, Oct 22, 2015 at 2:49 PM, Robert Haas  wrote:
> On Thu, Oct 22, 2015 at 1:38 PM, Tom Lane  wrote:
>> Um ... why would you not want the projections to happen in the child
>> nodes, where they could be parallelized?  Or am I missing something?
>
> You probably would, but sometimes that might not be possible; for
> example, the tlist might contain a parallel-restricted function (which
> therefore has to run in the leader).

And here's a (rather small) patch to push target-list evaluation down
to the worker whenever possible.  With this, if you do something like
...

SELECT aid + bid FROM pgbench_accounts WHERE aid % 1 = 0;

...then aid + bid can be calculated in the workers rather than in the
leader.  Unfortunately, if you do this...

SELECT sum(aid + bid) FROM pgbench_accounts WHERE aid % 1 = 0;

...then it doesn't help, because make_subplanTargetList() thinks it
may as well just pull aid and bid out of the join nest and then do the
sum during the aggregation stage.  Presumably if we get the parallel
aggregation stuff working then this will get fixed, because there will
be a Partial Aggregate step before the Gather.  But whether that gets
into 9.6 or not, this seems like a useful step forward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a3cc274..3252904 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1985,6 +1985,22 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 	 * the desired tlist.
 	 */
 	result_plan->targetlist = sub_tlist;
+
+	/*
+	 * If the plan happens to be a Gather plan, and the
+	 * subplan is projection-capable, then insert the tlist
+	 * there, too, so that the workers can help with
+	 * projection.  But if there is something that is not
+	 * parallel-safe in the tlist, then we can't.
+	 */
+	if (IsA(result_plan, Gather))
+	{
+		Plan	   *outer_plan = outerPlan(result_plan);
+
+		if (is_projection_capable_plan(outer_plan) &&
+			!has_parallel_hazard((Node *) sub_tlist, false))
+			outer_plan->targetlist = sub_tlist;
+	}
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] remove wal_level archive

2016-02-07 Thread Peter Eisentraut
On 1/26/16 10:56 AM, Simon Riggs wrote:
> Removing one of "archive" or "hot standby" will just cause confusion and
> breakage, so neither is a good choice for removal.
> 
> What we should do is 
> 1. Map "archive" and "hot_standby" to one level with a new name that
> indicates that it can be used for both/either backup or replication.
>   (My suggested name for the new level is "replica"...)
> 2. Deprecate "archive" and "hot_standby" so that those will be removed
> in a later release.

Updated patch to reflect these suggestions.

From a1df946bc77cb51ca149a52276175a001942d8d0 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 7 Feb 2016 10:09:05 -0500
Subject: [PATCH v2] Merge wal_level "archive" and "hot_standby" into new name
 "replica"

The distinction between "archive" and "hot_standby" existed only because
at the time "hot_standby" was added, there was some uncertainty about
stability.  This is now a long time ago.  We would like to move forward
with simplifying the replication configuration, but this distinction is
in the way, because a primary server cannot tell (without asking a
standby or predicting the future) which one of these would be the
appropriate level.

Pick a new name for the combined setting to make it clearer that it
covers all (non-logical) backup and replication uses.  The old values
are still accepted but are converted internally.
---
 doc/src/sgml/backup.sgml  |  2 +-
 doc/src/sgml/config.sgml  | 30 +++
 doc/src/sgml/high-availability.sgml   |  2 +-
 doc/src/sgml/ref/alter_system.sgml|  2 +-
 doc/src/sgml/ref/pgupgrade.sgml   |  2 +-
 src/backend/access/rmgrdesc/xlogdesc.c|  5 +++--
 src/backend/access/transam/xact.c |  2 +-
 src/backend/access/transam/xlog.c | 20 --
 src/backend/access/transam/xlogfuncs.c|  2 +-
 src/backend/postmaster/postmaster.c   |  2 +-
 src/backend/replication/slot.c|  5 -
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |  2 +-
 src/bin/pg_controldata/pg_controldata.c   |  6 ++
 src/bin/pg_rewind/RewindTest.pm   |  2 +-
 src/include/access/xlog.h | 11 +-
 src/include/catalog/pg_control.h  |  2 +-
 17 files changed, 42 insertions(+), 57 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 7413666..46017a6 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1285,7 +1285,7 @@ Standalone Hot Backups
   If more flexibility in copying the backup files is needed, a lower
   level process can be used for standalone hot backups as well.
   To prepare for low level standalone hot backups, set wal_level to
-  archive or higher, archive_mode to
+  replica or higher, archive_mode to
   on, and set up an archive_command that performs
   archiving only when a switch file exists.  For example:
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 392eb70..bed7436 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1971,9 +1971,9 @@ Settings
 wal_level determines how much information is written
 to the WAL. The default value is minimal, which writes
 only the information needed to recover from a crash or immediate
-shutdown. archive adds logging required for WAL archiving;
-hot_standby further adds information required to run
-read-only queries on a standby server; and, finally
+shutdown. replica adds logging required for WAL
+archiving as well as information required to run
+read-only queries on a standby server.  Finally,
 logical adds information necessary to support logical
 decoding.  Each level includes the information logged at all lower
 levels.  This parameter can only be set at server start.
@@ -1991,30 +1991,24 @@ Settings
  transaction
 
 But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so archive or
+data from a base backup and the WAL logs, so replica or
 higher must be used to enable WAL archiving
 () and streaming replication.


-In hot_standby level, the same information is logged as
-with archive, plus information needed to reconstruct
-the status of running transactions from the WAL. To enable read-only
-queries on a standby server, wal_level must be set to
-hot_standby or higher on the primary, and
- must be enabled in the standby. It is
-thought that there is little measurable difference in performance
-between using hot_standby and archive levels,
-so feedback is welcome if any production impacts are noticeable.
- 

Re: [HACKERS] Parallel Aggregate

2016-02-07 Thread Robert Haas
 On Thu, Jan 21, 2016 at 11:25 PM, Haribabu Kommi
 wrote:
>  [ new patch ]

This patch contains a number of irrelevant hunks that really ought not
to be here and make the patch harder to understand, like this:

-* Generate appropriate target list for
scan/join subplan; may be
-* different from tlist if grouping or
aggregation is needed.
+* Generate appropriate target list for
subplan; may be different from
+* tlist if grouping or aggregation is needed.

Please make a habit of getting rid of that sort of thing before submitting.

Generally, I'm not quite sure I understand the code here.  It seems to
me that what we ought to be doing is that grouping_planner, right
after considering using a presorted path (that is, just after the if
(sorted_path) block between lines 1822-1850), ought to then consider
using a partial path.  For the moment, it need not consider the
possibility that there may be a presorted partial path, because we
don't have any way to generate those yet.  (I have plans to fix that,
but not in time for 9.6.)  So it can just consider doing a Partial
Aggregate on the cheapest partial path using an explicit sort, or
hashing; then, above the Gather, it can finalize either by hashing or
by sorting and grouping.

The trick is that there's no path representation of an aggregate, and
there won't be until Tom finishes his upper planner path-ification
work.  But it seems to me we can work around that.  Set best_path to
the cheapest partial path, add a partial aggregate rather than a
regular one around where it says "Insert AGG or GROUP node if needed,
plus an explicit sort step if necessary", and then push a Gather node
and a Finalize Aggregate onto the result.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Greg Stark
On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas  wrote:
> On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan  wrote:
> > It was my understanding, based on your emphasis on producing only a
> > single run, as well as your recent remarks on this thread about the
> > first run being special, that you are really only interested in the
> > presorted case, where one run is produced. That is, you are basically
> > not interested in preserving the general ability of replacement
> > selection to double run size in the event of a uniform distribution.
>...
> > You don't want to change the behavior of the current patch for the
> > second or subsequent run; that should remain a quicksort, pure and
> > simple. Do I have that right?
>
> Yes.

I'm not even sure this is necessary. The idea of missing out on
producing a single sorted run sounds bad but in practice since we
normally do the final merge on the fly there doesn't seem like there's
really any difference between reading one tape or reading two or three
tapes when outputing the final results. There will be the same amount
of I/O happening and a 2-way or 3-way merge for most data types should
be basically free.



On Sun, Feb 7, 2016 at 8:21 PM, Robert Haas  wrote:
> 3. At this point, we have one sorted tape per worker.  Perform a final
> merge pass to get the final result.

I don't even think you have to merge until you get one tape per
worker. You can statically decide how many tapes you can buffer in
memory based on work_mem and merge until you get N/workers tapes so
that a single merge in the gather node suffices. I would expect that
to nearly always mean the workers are only responsible for generating
the initial sorted runs and the single merge pass is done in the
gather node on the fly as the tuples are read.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does plpython support threading?

2016-02-07 Thread Tom Lane
Jinhua Luo  writes:
> In my plpython function, I create new thread to do some routine works.
> But I found the thread doesn't work, it would run a while, but
> suddenly pause and stop working. For example, I write a test dead loop
> to print something to file, but it would stop printing after some
> time.

> I am wondering whether plpython supports threading or not.

Creating multiple threads inside a Postgres backend is playing with fire.
You can make it work if you're very very careful, but what's more likely
to happen is you get burned.  None of the backend code is multithread
safe, so you must absolutely ensure that control never gets out of
libpython except in the main thread.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcommitfest reply having corrupted subject line

2016-02-07 Thread Noah Misch
On Sun, Feb 07, 2016 at 03:49:47PM +0100, Magnus Hagander wrote:
> On Fri, Feb 5, 2016 at 3:44 AM, Noah Misch  wrote:
> > On Thu, Feb 04, 2016 at 09:19:19AM +0100, Magnus Hagander wrote:

> > So it is.  The problem happens when email.mime.text.MIMEText wraps the
> > subject line; see attached test program.
> 
> Ouch. So that's basically a bug in the python standard libraries :( It
> looks a lot like this one, doesn't it? https://bugs.python.org/issue25257
> 
> > > Have you
> > > seen this with any other messages, that you can recall, or just this one?
> >
> > Just this one.  However, no other hackers subject line since 2015-01-01
> > contains a comma followed by something other than a space.
> 
> I wonder in that case if it's worth trying to do something about that in
> our code, or just leave it be.

In light of those facts, I would leave it.  Thanks for tracking down the
Python bug.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-07 Thread Robert Haas
On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai  wrote:
> I can agree that smgr hooks shall be primarily designed to make storage
> systems pluggable, even if we can use this hooks for suspend & resume of
> write i/o stuff.
> In addition, "pluggable storage" is a long-standing feature, even though
> it is not certain whether existing smgr hooks are good starting point.
> It may be a risk if we implement a grand feature on top of the hooks
> but out of its primary purpose.
>
> So, my preference is a mechanism to hook buffer write to implement this
> feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
> has nearly zero cost when no extension activate the feature.)
> One downside of this approach is larger number of hook points.
> We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
> and FlushRelationBuffers, in addition to FlushBuffer, at least.

I don't understand what you're hoping to achieve by introducing
pluggability at the smgr layer.  I mean, md.c is pretty much good for
read and writing from anything that looks like a directory of files.
Another smgr layer would make sense if we wanted to read and write via
some kind of network protocol, or if we wanted to have some kind of
storage substrate that did internally to itself some of the tasks for
which we are currently relying on the filesystem - e.g. if we wanted
to be able to use a raw device, or perhaps more plausibly if we wanted
to reduce the number of separate files we need, or provide a substrate
that can clip an unused extent out of the middle of a relation
efficiently.  But I don't understand what this has to do with what
you're trying to do here.  The subject of this thread is about whether
you can check for the presence of a block in shared_buffers, and as
discussed upthread, you can.  I don't quite follow how we made the
jump from there to smgr pluggability.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-07 Thread Robert Haas
On Sat, Feb 6, 2016 at 12:47 PM, Tom Lane  wrote:
> So I'm of the opinion that a great deal more work is needed here.
> But it's not something we're going to be able to get done for 9.5.1,
> or realistically 9.5.anything.  Whereas adding additional klugery to
> ExecHashRemoveNextSkewBucket probably is doable over the weekend.

Tom, are you taking care of this, or should I be getting involved here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-07 Thread Tom Lane
Robert Haas  writes:
> On Sat, Feb 6, 2016 at 12:47 PM, Tom Lane  wrote:
>> So I'm of the opinion that a great deal more work is needed here.
>> But it's not something we're going to be able to get done for 9.5.1,
>> or realistically 9.5.anything.  Whereas adding additional klugery to
>> ExecHashRemoveNextSkewBucket probably is doable over the weekend.

> Tom, are you taking care of this, or should I be getting involved here?

I'm on it, so far as fixing the stop-ship bug is concerned.  I was not
volunteering to do the larger rewrite.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Robert Haas
On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan  wrote:
> Right. Let me give you the executive summary first: I continue to
> believe, following thinking about the matter in detail, that this is a
> sensible compromise, that weighs everyone's concerns. It is pretty
> close to a win-win. I just need you to confirm what I say here in
> turn, so we're sure that we understand each other perfectly.

Makes sense to me.

>> The basic idea is that we will add a new GUC with a name like
>> replacement_sort_mem that will have a default value in the range of
>> 20-30MB; or possibly we will hardcode this value, but for purposes of
>> this email I'm going to assume it's a GUC.  If the value of work_mem
>> or maintenance_work_mem, whichever applies, is smaller than the value
>> of replacement_sort_mem, then the latter has no effect.
>
> By "no effect", you must mean that we always use a heap for the entire
> first run (albeit for the tail, with a hybrid quicksort/heap
> approach), but still use quicksort for every subsequent run, when it's
> clearly established that we aren't going to get one huge run. Is that
> correct?

Yes.

> It was my understanding, based on your emphasis on producing only a
> single run, as well as your recent remarks on this thread about the
> first run being special, that you are really only interested in the
> presorted case, where one run is produced. That is, you are basically
> not interested in preserving the general ability of replacement
> selection to double run size in the event of a uniform distribution.
> (That particular doubling property of replacement selection is now
> technically lost by virtue of using this new hybrid model *anyway*,
> although it will still make runs longer in general).
>
> You don't want to change the behavior of the current patch for the
> second or subsequent run; that should remain a quicksort, pure and
> simple. Do I have that right?

Yes.

> BTW, parallel sort should probably never use a heap anyway (ISTM that
> that will almost certainly be based on external sorts in the end). A
> heap is not really compatible with the parallel heap scan model.

I don't think I agree with this part, though I think it's unimportant
as far as the current patch is concerned.  My initial thought is that
parallel sort should work like this:

1. Each worker reads and sorts its input tuples just as it would in
non-parallel mode.

2. If, at the conclusion of the sort, the input tuples are still in
memory (quicksort) or partially in memory (quicksort with spillover),
then write them all to a tape.  If they are on multiple tapes, merge
those to a single tape.  If they are on a single tape, do nothing else
at this step.

3. At this point, we have one sorted tape per worker.  Perform a final
merge pass to get the final result.

The major disadvantage of this is that if the input hasn't been
relatively evenly partitioned across the workers, the work of sorting
will fall disproportionately on those that got more input.  We could,
in the future, make the logic more sophisticated.  For example, if
worker A is still reading the input and dumping sorted runs, worker B
could start merging those runs.  Or worker A could read tuples into a
DSM instead of backend-private memory, and worker B could then sort
them to produce a run.  While such optimizations are clearly
beneficial, I would not try to put them into a first parallel sort
patch.  It's too complicated.

>> One thing I just thought of (after the call) is that it might be
>> better for this GUC to be in units of tuples rather than in units of
>> memory; it's not clear to me why the optimal heap size should be
>> dependent on the tuple size, so we could have a threshold like 300,000
>> tuples or whatever.
>
> I think you're right that a number of tuples is the logical way to
> express the heap size (as a GUC unit). I think that the ideal setting
> for the GUC is large enough to recognize significant correlations in
> input data, which may be clustered, but no larger (at least while
> things don't all fit in L1 cache, or maybe L2 cache). We should "go
> for broke" with replacement selection -- we don't aim for anything
> less than ending up with 1 run by using the heap (merging 2 or 3 runs
> rather than 4 or 6 is far less useful, maybe harmful, when one of them
> is much larger). Therefore, I don't expect that we'll be practically
> disadvantaged by having fewer "hands to juggle" tuples here (we'll
> simply almost always have enough in practice -- more on that later).
> FWIW I don't think that any benchmark we've seen so far justifies
> doing less than "going for broke" with RS, even if you happen to have
> a very conservative perspective.
>
> One advantage of a GUC is that you can set it to zero, and always get
> a simple hybrid sort-merge strategy if that's desirable. I think that
> it might not matter much with multi-gigabyte work_mem settings anyway,
> though; you'll just see a small blip. Big 

Re: [HACKERS] First-draft release notes for next week's back-branch releases

2016-02-07 Thread Tom Lane
Michael Paquier  writes:
> Bruce is the main author of this patch. I used what he did as a base
> to build a version correct for MSVC.

Fixed, thanks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-07 Thread Kouhei Kaigai
> On Wed, Feb 3, 2016 at 11:57 PM, Kouhei Kaigai  wrote:
> > At this moment, I tried to write up description at nodes/nodes.h.
> > The amount of description is about 100lines. It is on a borderline
> > whether we split off this chunk into another header file, in my sense.
> >
> >
> > On the other hands, I noticed that we cannot omit checks for individual
> > callbacks on Custom node type, ExtensibleNodeMethods is embedded in
> > the CustomMethods structure, thus we may have Custom node with
> > no extensible feature.
> > This manner is beneficial because extension does not need to register
> > the library and symbol name for serialization. So, CustomScan related
> > code still checks existence of individual callbacks.
> 
> I was looking over this patch yesterday, and something was bugging me
> about it, but I couldn't quite put my finger on what it was.  But now
> I think I know.
> 
> I think of an extensible node type facility as something that ought to
> be defined to allow a user to create new types of nodes.  But this is
> not that.  What this does is allows you to have a CustomScan or
> ForeignScan node - that is, the existing node type - which is actually
> larger than a normal node of that type and has some extra data that is
> part of it.  I'm having a really hard time being comfortable with that
> concept.  Somehow, it seems like the node type should determine the
> size of the node.  I can stretch my brain to the point of being able
> to say, well, maybe if the node tag is T_ExtensibleNode, then you can
> look at char *extnodename to figure out what type of node it is
> really, and then from there get the size.  But what this does is:
> every time you see a T_CustomScan or T_ForeignScan node, it might not
> really be that kind of node but something else else, and tomorrow
> there might be another half-dozen node types with a similar property.
> And every one of those node types will have char *extnodename in a
> different place in the structure, so a hypothetical piece of code that
> wanted to find the extension methods for a node, or the size of a
> node, would need a switch that knows about all of those node types.
> It feels very awkward.
> 
> So I have a slightly different proposal.  Let's forget about letting
> T_CustomScan or T_ForeignScan or any other built-in node type vary in
> size.  Instead, let's add T_ExtensibleNode which acts as a completely
> user-defined node.  It has read/out/copy/equalfuncs support along the
> lines of what this patch implements, and that's it.  It's not a scan
> or a path or anything like that: it's just an opaque data container
> that the system can input, output, copy, and test for equality, and
> nothing else.  Isn't that useless?  Not at all.  If you're creating an
> FDW or custom scan provider and want to store some extra data, you can
> create a new type of extensible node and stick it in the fdw_private
> or custom_private field!  The data won't be part of the CustomScan or
> ForeignScan structure itself, as in this patch, but who cares? The
> only objection I can see is that you still need several pointer
> deferences to get to the data since fdw_private is a List, but if
> that's actually a real performance problem we could probably fix it by
> just changing fdw_private to a Node instead.  You'd still need one
> pointer dereference, but that doesn't seem too bad.
> 
> Thoughts?
>
The new callbacks of T_ExtensibleNode will replace the necessity to
form and deform process of private values, like as:
  https://github.com/pg-strom/devel/blob/master/src/gpujoin.c#L114

It transforms a bunch of internal data of CustomScan (similar to the
extended fields in T_ExtensibleNode) to/from the node functions
understandable forms for copy, input and output support.
I think it implies you proposition is workable.

I'd like to follow this proposition basically.
On the other hands, I have two points I want to pay attention.

1. At this moment, it is allowed to define a larger structure that
embeds CustomPath and CustomScanState by extension. How do we treat
this coding manner in this case? Especially, CustomScanState has no
private pointer dereference because it assumes an extension define
a larger structure. Of course, we don't need to care about node
operations on Path and PlanState nodes, but right now.

2. I intended to replace LibraryName and SymbolName fields from the
CustomScanMethods structure by integration of extensible node type.
We had to give a pair of these identifiers because custom scan provider
has no registration points at this moment. A little concern is extension
has to assume a particular filename of itself.
But, probably, it shall be a separated discussion. My preference is
preliminary registration of custom scan provider by its name, as well
as extensible node.


Towards the last question; whether *_private shall be void * or List *,
I want to keep fdw_private and custom_private as List * pointer, because
a new 

Re: [HACKERS] GIN pending list clean up exposure to SQL

2016-02-07 Thread Jeff Janes
On Wed, Jan 27, 2016 at 12:54 PM, Julien Rouhaud
 wrote:
> On 27/01/2016 10:27, Fujii Masao wrote:
>>
>> Thanks for updating the patch! It looks good to me.
>>
>> Based on your patch, I just improved the doc. For example, I added
>> the following note into the doc.
>>
>> +These functions cannot be executed during recovery. +Use
>> of these functions is restricted to superusers and the owner +
>> of the given index.
>>
>> If there is no problem, I'm thinking to commit this version.
>>
>
> Just a detail:
>
> +Note that the cleanup does not happen and the return value is 0
> +if the argument is the GIN index built with fastupdate
> +option disabled because it doesn't have a pending list.
>
> It should be "if the argument is *a* GIN index"
>
> I find this sentence a little confusing, maybe rephrase like this would
> be better:
>
> -Note that the cleanup does not happen and the return value is 0
> -if the argument is the GIN index built with fastupdate
> -option disabled because it doesn't have a pending list.
> +Note that if the argument is a GIN index built with
> fastupdate
> +option disabled, the cleanup does not happen and the return value
> is 0
> +because the index doesn't have a pending list.
>
> Otherwise, I don't see any problem on this version.

This is a corner case that probably does not need to be in the docs,
but I wanted to clarify it here in case you disagree: If the index
ever had fastupdate turned on, when it is turned off the index will
keep whatever pending_list it had until something cleans it up one
last time.

Thanks for the review and changes and commit.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Combining Aggregates

2016-02-07 Thread Haribabu Kommi
On Thu, Jan 21, 2016 at 3:42 PM, David Rowley
 wrote:
> On 21 January 2016 at 08:06, Robert Haas  wrote:
>>
>> On Wed, Jan 20, 2016 at 7:38 AM, David Rowley
>>  wrote:
>> > Agreed. So I've attached a version of the patch which does not have any
>> > of
>> > the serialise/deserialise stuff in it.
>>
>> I re-reviewed this and have committed most of it with only minor
>> kibitizing.  A few notes:
>
>
> I've attached the re-based remainder, which includes the serial/deserial
> again.
>
> I'll submit this part to March 'fest, where hopefully we'll also have
> something to utilise it.
>

While testing parallel aggregate with float4 and float8 types based on
the latest patch,
I found the following problems,

+ /*
+ * For partial aggregation we must adjust the return types of
+ * the Aggrefs
+ */
+ if (!aggplan->finalizeAggs)
+ set_partialagg_aggref_types(root, plan,
+ aggplan->serialStates);

[...]

+ aggref->aggtype = aggform->aggserialtype;
+ else
+ aggref->aggtype = aggform->aggtranstype;

Changing the aggref->aggtype with aggtranstype or aggserialtype will
only gets it changed in
partial aggregate plan, as set_upper_references starts from the top
plan and goes
further. Because of this reason, the targetlist contains for the node
below finalize
aggregate are still points to original type only.

To fix this problem, I tried updating the targetlist aggref->aggtype
with transtype during
aggregate plan itself, that leads to a problem in setting upper plan
references. This is
because, while fixing the aggregate reference of upper plans after
partial aggregate,
the aggref at upper plan nodes doesn't match with aggref that is
coming from partial
aggregate node because of aggtype difference in _equalAggref function.

COMPARE_SCALAR_FIELD(aggtype);

Temporarily i corrected it compare it against aggtranstype and
aggserialtype also then
it works fine. I don't see that change as correct approach. Do you
have any better idea
to solve this problem?

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-07 Thread Filip Rembiałkowski
On Sun, Feb 7, 2016 at 4:37 PM, Vik Fearing  wrote:

>>> There is also no mention in the documentation about what happens if I do:
>>>
>>> NOTIFY ALL chan, 'msg';
>>> NOTIFY ALL chan, 'msg';
>>> NOTIFY DISTINCT chan, 'msg';
>>> NOTIFY ALL chan, 'msg';
>>>
>>> Without testing, I'd say I'd get two messages, but it should be
>>> explicitly mentioned somewhere.
>>
>> If it's four separate transactions, LISTEN'er should get four.
>
> The question was for one transaction, I should have been clearer about that.
>
>> If it's in one transaction,  LISTEN'er should get three.
>
> This is surprising to me, I would think it would get only two.  What is
> your rationale for three?
>

It is a single transaction, but four separate commands.

>>> NOTIFY ALL chan, 'msg';
-- send the message, save in the list/hash
>>> NOTIFY ALL chan, 'msg';
-- ALL was specified, send the message even if it is on the list/hash
>>> NOTIFY DISTINCT chan, 'msg';
-- default mode, skip message because it's in the list/hash
>>> NOTIFY ALL chan, 'msg';
-- ALL was specified, send the message even if it is hashed/saved


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2016-02-07 Thread Amit Langote

Hi Vinayak,

Thanks for updating the patch, a couple of comments:

On 2016/02/05 17:15, poku...@pm.nttdata.co.jp wrote:
> Hello,
> 
> Please find attached updated patch.
>> The point of having pgstat_report_progress_update_counter() is so that 
>> you can efficiently update a single counter without having to update 
>> everything, when only one counter has changed.  But here you are 
>> calling this function a whole bunch of times in a row, which 
>> completely misses the point - if you are updating all the counters, 
>> it's more efficient to use an interface that does them all at once 
>> instead of one at a time.
> 
> The pgstat_report_progress_update_counter() is called at appropriate places 
> in the attached patch.

+   charprogress_message[N_PROGRESS_PARAM][PROGRESS_MESSAGE_LENGTH];

[ ... ]

+   snprintf(progress_message[0], PROGRESS_MESSAGE_LENGTH, "%s", phase1);
+   pgstat_report_progress_update_message(0, progress_message);

[ ... ]

+   snprintf(progress_message[0], PROGRESS_MESSAGE_LENGTH, 
"%s", phase2);
+   pgstat_report_progress_update_message(0, 
progress_message);

Instead of passing the array of char *'s, why not just pass a single char
*, because that's what it's doing - updating a single message. So,
something like:

+ char progress_message[PROGRESS_MESSAGE_LENGTH];

[ ... ]

+ snprintf(progress_message, PROGRESS_MESSAGE_LENGTH, "%s", phase1);
+ pgstat_report_progress_update_message(0, progress_message);

[ ... ]

+ snprintf(progress_message, PROGRESS_MESSAGE_LENGTH, "%s", phase2);
+ pgstat_report_progress_update_message(0, progress_message);

And also:

+/*---
+ * pgstat_report_progress_update_message()-
+ *
+ *Called to update phase of VACUUM progress
+ *---
+ */
+void
+pgstat_report_progress_update_message(int index, char *msg)
+{

[ ... ]

+   pgstat_increment_changecount_before(beentry);
+   strncpy((char *)beentry->st_progress_message[index], msg,
PROGRESS_MESSAGE_LENGTH);
+   pgstat_increment_changecount_after(beentry);


One more comment:

@@ -1120,14 +1157,23 @@ lazy_scan_heap(Relation onerel, LVRelStats
*vacrelstats,
/* Log cleanup info before we touch indexes */
vacuum_log_cleanup_info(onerel, vacrelstats);

+   snprintf(progress_message[0], PROGRESS_MESSAGE_LENGTH, "%s", 
phase2);
+   pgstat_report_progress_update_message(0, progress_message);
/* Remove index entries */
for (i = 0; i < nindexes; i++)
+   {
lazy_vacuum_index(Irel[i],
  [i],
  vacrelstats);
+   scanned_index_pages += 
RelationGetNumberOfBlocks(Irel[i]);
+   /* Update the scanned index pages and number of index 
scan */
+   pgstat_report_progress_update_counter(3, 
scanned_index_pages);
+   pgstat_report_progress_update_counter(4, 
vacrelstats->num_index_scans
+ 1);
+   }
/* Remove tuples from heap */
lazy_vacuum_heap(onerel, vacrelstats);
vacrelstats->num_index_scans++;
+   scanned_index_pages = 0;

I guess num_index_scans could better be reported after all the indexes are
done, that is, after the for loop ends.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Peter Geoghegan
On Sun, Feb 7, 2016 at 4:50 PM, Peter Geoghegan  wrote:
>> I'm not even sure this is necessary. The idea of missing out on
>> producing a single sorted run sounds bad but in practice since we
>> normally do the final merge on the fly there doesn't seem like there's
>> really any difference between reading one tape or reading two or three
>> tapes when outputing the final results. There will be the same amount
>> of I/O happening and a 2-way or 3-way merge for most data types should
>> be basically free.
>
> I basically agree with you, but it seems possible to fix the
> regression (generally misguided though those regressed cases are).
> It's probably easiest to just fix it.

On a related note, we should probably come up with a way of totally
supplanting the work_mem model with something smarter in the next
couple of years. Something that treats memory as a shared resource
even when it's allocated privately, per-process. This external sort
stuff really smooths out the cost function of sorts. ISTM that that
makes the idea of dynamic memory budgets (in place of a one size fits
all work_mem) seem desirable for the first time. That said, I really
don't have a good sense of how to go about moving in that direction at
this point. It seems less than ideal that DBAs have to be so
conservative in sizing work_mem.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does plpython support threading?

2016-02-07 Thread Jinhua Luo
The new thread do not call any pg API or access any pg specific data.
The pg backend (main thread) puts string message into the python
Queue, and the new thread gets from the Queue and write to file, it's
simple and clear logic.

I test the same codes standalone, it works, but when I move them into
the plpython function. The new thread would be pause and seems freeze
(as said above, I even try to put a dead loop in the thread, it
doesn't run a dead loop actually).

So I guess it's somehow plpython specific limit?

Regards,
Jinhua Luo


2016-02-08 0:45 GMT+08:00 Tom Lane :
> Jinhua Luo  writes:
>> In my plpython function, I create new thread to do some routine works.
>> But I found the thread doesn't work, it would run a while, but
>> suddenly pause and stop working. For example, I write a test dead loop
>> to print something to file, but it would stop printing after some
>> time.
>
>> I am wondering whether plpython supports threading or not.
>
> Creating multiple threads inside a Postgres backend is playing with fire.
> You can make it work if you're very very careful, but what's more likely
> to happen is you get burned.  None of the backend code is multithread
> safe, so you must absolutely ensure that control never gets out of
> libpython except in the main thread.
>
> regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort for every external sort run

2016-02-07 Thread Peter Geoghegan
On Sun, Feb 7, 2016 at 10:51 AM, Greg Stark  wrote:
>> > You don't want to change the behavior of the current patch for the
>> > second or subsequent run; that should remain a quicksort, pure and
>> > simple. Do I have that right?
>>
>> Yes.
>
> I'm not even sure this is necessary. The idea of missing out on
> producing a single sorted run sounds bad but in practice since we
> normally do the final merge on the fly there doesn't seem like there's
> really any difference between reading one tape or reading two or three
> tapes when outputing the final results. There will be the same amount
> of I/O happening and a 2-way or 3-way merge for most data types should
> be basically free.

I basically agree with you, but it seems possible to fix the
regression (generally misguided though those regressed cases are).
It's probably easiest to just fix it.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregate

2016-02-07 Thread Haribabu Kommi
On Mon, Feb 8, 2016 at 2:00 AM, Robert Haas  wrote:
> On Sun, Jan 24, 2016 at 7:56 PM, Haribabu Kommi
>  wrote:
>> On Sat, Jan 23, 2016 at 12:59 PM, Haribabu Kommi
>>  wrote:
>>> Here I attached updated patch with additional combine function for
>>> two stage aggregates also.
>>
>> A wrong combine function was added in pg_aggregate.h in the earlier
>> patch that leading to
>> initdb problem. Corrected one is attached.
>
> I'm not entirely sure I know what's going on here, but I'm pretty sure
> that it makes no sense for the new float8_pl function to reject
> non-aggregate callers at the beginning and then have a comment at the
> end indicating what it does when not invoked as an aggregate.
> Similarly for the other new function.
>
> It would be a lot more clear what this patch was trying to accomplish
> if the new functions had header comments explaining their purpose -
> not what they do, but why they exist.

I added some header comments explaining the need of these functions
and when they will be used? These combine functions are necessary
to float4 and float8 for parallel aggregation.

> float8_regr_pl is labeled in pg_proc.h as an aggregate transition
> function, but I'm wondering if it should say combine function.

corrected.

> The changes to pg_aggregate.h include a large number of
> whitespace-only changes which are unacceptable.  Please change only
> the lines that need to be changed.

I try to align the other rows according to the new combine function addition,
that leads to a white space problem, I will take care of such things in future.
Here I attached updated patch with the corrections.

Regards,
Hari Babu
Fujitsu Australia


additional_combine_fns_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Way to check whether a particular block is on the shared_buffer?

2016-02-07 Thread Kouhei Kaigai
> -Original Message-
> From: Robert Haas [mailto:robertmh...@gmail.com]
> Sent: Monday, February 08, 2016 1:52 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai  wrote:
> > I can agree that smgr hooks shall be primarily designed to make storage
> > systems pluggable, even if we can use this hooks for suspend & resume of
> > write i/o stuff.
> > In addition, "pluggable storage" is a long-standing feature, even though
> > it is not certain whether existing smgr hooks are good starting point.
> > It may be a risk if we implement a grand feature on top of the hooks
> > but out of its primary purpose.
> >
> > So, my preference is a mechanism to hook buffer write to implement this
> > feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
> > has nearly zero cost when no extension activate the feature.)
> > One downside of this approach is larger number of hook points.
> > We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
> > and FlushRelationBuffers, in addition to FlushBuffer, at least.
> 
> I don't understand what you're hoping to achieve by introducing
> pluggability at the smgr layer.  I mean, md.c is pretty much good for
> read and writing from anything that looks like a directory of files.
> Another smgr layer would make sense if we wanted to read and write via
> some kind of network protocol, or if we wanted to have some kind of
> storage substrate that did internally to itself some of the tasks for
> which we are currently relying on the filesystem - e.g. if we wanted
> to be able to use a raw device, or perhaps more plausibly if we wanted
> to reduce the number of separate files we need, or provide a substrate
> that can clip an unused extent out of the middle of a relation
> efficiently.  But I don't understand what this has to do with what
> you're trying to do here.  The subject of this thread is about whether
> you can check for the presence of a block in shared_buffers, and as
> discussed upthread, you can.  I don't quite follow how we made the
> jump from there to smgr pluggability.
>
Yes. smgr pluggability is not what I want to investigate in this thread.
It is not a purpose of discussion, but one potential "idea to implement".

Through the discussion, it became clear that extension can check existence
of buffer of a particular block, using existing infrastructure.

On the other hands, it also became clear we have to guarantee OS buffer
or storage block must not be updated partially during the P2P DMA.
My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
filter out unnecessary rows and columns prior to loading to CPU/RAM.
It needs to ensure PostgreSQL does not write out buffers to OS buffers
to avoid unexpected data corruption.

What I want to achieve is suspend of buffer write towards a particular
(relnode, forknum, blocknum) pair for a short time, by completion of
data processing by GPU (or other external devices).
In addition, it is preferable being workable regardless of the choice
of storage manager, even if we may have multiple options on top of the
pluggable smgr in the future.

The data processing close to the storage needs OS buffer should not be
updated under the P2P DMA, concurrently. So, I want the feature below.
1. An extension (that controls GPU and P2P DMA) can register a particular
   (relnode, forknum, blocknum) pair as suspended block for write.
2. Once a particular block gets suspended, smgrwrite (or its caller) shall
   be blocked unless the above suspended block is not unregistered.
3. The extension will unregister when P2P DMA from the blocks get completed,
   then suspended concurrent backend shall be resumed to write i/o.
4. On the other hands, the extension cannot register the block if some
   other concurrent executes smgrwrite, to avoid potential data flaw.

One idea was injection of a thin layer on top of the smgr mechanism, to
implement the above mechanism.
However, I'm also uncertain whether injection to entire smgr hooks is
a straightforward approach to achieve it.

The minimum stuff I want is a facility to get a control at the head and tail
of smgrwrite() - to suspend the concurrent write prior to smgr_write, and to
notify the concurrent smgr_write gets completed for the mechanism.

It does not need pluggability of smgr, but entrypoint shall be located around
smgr functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN pending list clean up exposure to SQL

2016-02-07 Thread Fujii Masao
On Mon, Feb 8, 2016 at 9:53 AM, Jeff Janes  wrote:
> On Wed, Jan 27, 2016 at 12:54 PM, Julien Rouhaud
>  wrote:
>> On 27/01/2016 10:27, Fujii Masao wrote:
>>>
>>> Thanks for updating the patch! It looks good to me.
>>>
>>> Based on your patch, I just improved the doc. For example, I added
>>> the following note into the doc.
>>>
>>> +These functions cannot be executed during recovery. +Use
>>> of these functions is restricted to superusers and the owner +
>>> of the given index.
>>>
>>> If there is no problem, I'm thinking to commit this version.
>>>
>>
>> Just a detail:
>>
>> +Note that the cleanup does not happen and the return value is 0
>> +if the argument is the GIN index built with fastupdate
>> +option disabled because it doesn't have a pending list.
>>
>> It should be "if the argument is *a* GIN index"
>>
>> I find this sentence a little confusing, maybe rephrase like this would
>> be better:
>>
>> -Note that the cleanup does not happen and the return value is 0
>> -if the argument is the GIN index built with fastupdate
>> -option disabled because it doesn't have a pending list.
>> +Note that if the argument is a GIN index built with
>> fastupdate
>> +option disabled, the cleanup does not happen and the return value
>> is 0
>> +because the index doesn't have a pending list.
>>
>> Otherwise, I don't see any problem on this version.
>
> This is a corner case that probably does not need to be in the docs,
> but I wanted to clarify it here in case you disagree: If the index
> ever had fastupdate turned on, when it is turned off the index will
> keep whatever pending_list it had until something cleans it up one
> last time.

I agree to add that note to the doc. Or we should remove the above
description that I added?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-07 Thread Michael Paquier
On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund  wrote:
> On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
>> The patch attached will apply on master, on 9.5 there is one minor
>> conflict. For older versions we will need another reworked patch.
>
> FWIW, I don't think we should backpatch this. It'd look noticeably
> different in the back branches, and this isn't really a critical
> issue. I think it makes sense to see this as an optimization.

The original bug report of this thread is based on 9.4, which is the
first version where the standby snapshot in the bgwriter has been
introduced. Knowing that most of the systems in the world are actually
let idle. If those are running Postgres, I'd like to think that it is
a waste. Now it is true that this is not a critical issue.

>> + /*
>> +  * Update the progress LSN positions. At least one WAL insertion lock
>> +  * is already taken appropriately before doing that, and it is just 
>> more
>> +  * simple to do that here where WAL record data and type is at hand.
>> +  * The progress is set at the start position of the record tracked that
>> +  * is being added, making easier checkpoint progress tracking as the
>> +  * control file already saves the start LSN position of the last
>> +  * checkpoint run.
>> +  */
>> + if (!isStandbySnapshot)
>> + {
>
> I don't like isStandbySnapshot much, it seems better to do this more
> generally, via a flag passed down by the inserter.

Hm, OK. I think that the best way to deal with that is to introduce
XLogInsertExtended which wraps the existing XLogInsert(). By default
the flag is true. This will reduce the footprint of this patch on the
code base and will ease future backpatches on those code paths at
least down to 9.5 where WAL refactoring has been introduced.

>> + if (holdingAllLocks)
>> + {
>> + int i;
>> +
>> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
>> + WALInsertLocks[i].l.progressAt = StartPos;
>
> Why update all?

For consistency. I agree that updating one, say the first one is
enough. I have added a comment explaining that as well.

>>  /*
>> + * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
>> + * at the last significant WAL activity, or in other words any activity
>> + * not referring to standby logging as of now. Finding the last activity
>> + * position is done by scanning each WAL insertion lock by taking directly
>> + * the light-weight lock associated to it.
>> + */
>> +XLogRecPtr
>> +GetProgressRecPtr(void)
>> +{
>> + XLogRecPtr  res = InvalidXLogRecPtr;
>> + int i;
>> +
>> + /*
>> +  * Look at the latest LSN position referring to the activity done by
>> +  * WAL insertion.
>> +  */
>> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
>> + {
>> + XLogRecPtr  progress_lsn;
>> +
>> + LWLockAcquire([i].l.lock, LW_EXCLUSIVE);
>> + progress_lsn = WALInsertLocks[i].l.progressAt;
>> + LWLockRelease([i].l.lock);
>
> Add a comment explaining that we a) need a lock because of the potential
> for "torn reads" on some platforms. b) need an exclusive one, because
> the insert lock logic currently only expects exclusive locks.

Check.

>>   /*
>> +  * Fetch the progress position before taking any WAL insert lock. This
>> +  * is normally an operation that does not take long, but leaving this
>> +  * lookup out of the section taken an exclusive lock saves a couple
>> +  * of instructions.
>> +  */
>> + progress_lsn = GetProgressRecPtr();
>
> too long for my taste. How about:
> /* get progress, before acuiring insert locks to shorten locked section */

Check.

> Looks much better now.

Thanks. Let's wrap that! An updated patch is attached.
-- 
Michael


hs-checkpoints-v5.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL Re-Writes

2016-02-07 Thread Amit Kapila
On Wed, Feb 3, 2016 at 7:12 PM, Robert Haas  wrote:
>
> On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila 
wrote:
> > On further testing, it has been observed that misaligned writes could
> > cause reads even when blocks related to file are not in-memory, so
> > I think what Jan is describing is right.  The case where there is
> > absolutely zero chance of reads is when we write in OS-page boundary
> > which is generally 4K.  However I still think it is okay to provide an
> > option for  WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
> > for the cases when these are beneficial like when wal_level is
> > greater than equal to Archive and keep default as OS-page size if
> > the same is smaller than 8K.
>
> Hmm, a little research seems to suggest that 4kB pages are standard on
> almost every system we might care about: x86_64, x86, Power, Itanium,
> ARMv7.  Sparc uses 8kB, though, and a search through the Linux kernel
> sources (grep for PAGE_SHIFT) suggests that there are other obscure
> architectures that can at least optionally use larger pages, plus a
> few that can use smaller ones.
>
> I'd like this to be something that users don't have to configure, and
> it seems like that should be possible.  We can detect the page size on
> non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by
> using GetSystemInfo.  And I think it's safe to make this decision at
> configure time, because the page size is a function of the hardware
> architecture (it seems there are obscure systems that support multiple
> page sizes, but I don't care about them particularly).  So what I
> think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and
> set it to the smaller of XLOG_BLCKSZ and the system page size.  If we
> can't determine the system page size, assume 4kB.
>

I think deciding it automatically without user require to configure it,
certainly has merits, but what about some cases where user can get
benefits by configuring themselves like the cases where we use
PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
buffers and won't cause misaligned writes even for smaller chunk sizes
like 512 bytes or so).  Some googling [1] reveals that other databases
also provides user with option to configure wal block/chunk size (as
BLOCKSIZE), although they seem to decide chunk size based on
disk-sector size.

An additional thought, which is not necessarily related to this patch is,
if user chooses and or we decide to write in 512 bytes sized chunks,
which is usually a disk sector size, then can't we think of avoiding
CRC for each record for such cases, because each WAL write in
it-self will be atomic.  While reading, if we process in wal-chunk-sized
units, then I think it should be possible to detect end-of-wal based
on data read.

[1] -
http://docs.oracle.com/cd/E11882_01/server.112/e41084/clauses004.htm#SQLRF52268

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-02-07 Thread kharagesuraj
hello,

I have tested v7 patch.
but i think you forgot to remove some debug points in patch from
src/backend/replication/syncrep.c file.

for (i = 0; i < num_sync; i++)
+   {
+   elog(WARNING, "sync_standbys[%d] = %d", i, sync_standbys[i]);
+   }
+   elog(WARNING, "num_sync = %d, s_s_num = %d", num_sync,
synchronous_standby_num);

Please correct my understanding if i am wrong.

Regards
Suraj Kharage 





--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5886259.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: multiple psql option -c

2016-02-07 Thread Pavel Stehule
2016-02-05 2:35 GMT+01:00 Peter Eisentraut :

> On 12/29/15 10:38 AM, Bruce Momjian wrote:
> > On Thu, Dec 10, 2015 at 11:10:55AM -0500, Robert Haas wrote:
> >> On Wed, Dec 9, 2015 at 12:15 AM, Pavel Stehule 
> wrote:
>  On Wed, Dec 9, 2015 at 2:07 PM, Pavel Stehule <
> pavel.steh...@gmail.com>
>  wrote:
> > should be noted, recorded somewhere so this introduce possible
> > compatibility
> > issue - due default processing .psqlrc.
> 
>  That's written in the commit log, so I guess that's fine.
> >>>
> >>> ook
> >>
> >> Bruce uses the commit log to prepare the release notes, so I guess
> >> he'll make mention of this.
> >
> > Yes, I will certainly see it.
>
> I generally use the master branch psql for normal work, and this change
> has caused massive breakage for me.  It's straightforward to fix, but in
> some cases the breakage is silent, for example if you do
> something=$(psql -c ...) and the .psqlrc processing causes additional
> output.  I'm not sure what to make of it yet, but I want to mention it,
> because I fear there will be heartache.
>

We can still introduce some system environment, that disable psqlrc globally

PGNOPSQLRC=1 psql ...

Regards

Pavel