Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> I believe the attached patch should fix this by actually copying the 
> tuples into the densely allocated chunks. Haven't tested it though, will 
> do in a few hours.

BTW, I confirmed that this patch fixes the wrong-number-of-output-tuples
issue in the test case from bug #13908.  So that shows that the diagnosis
is correct.  We still need to clean up the patch, but this way does work
to fix the problem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra



On 02/06/2016 08:39 PM, Andres Freund wrote:

On 2016-02-06 20:34:07 +0100, Tomas Vondra wrote:

On 02/06/2016 06:47 PM, Tom Lane wrote:

* It incorporates a bespoke reimplementation of palloc into hash
joins. This is not a maintainable/scalable way to go about reducing
memory consumption. It should have been done with an arm's-length API
to a new type of memory context, IMO (probably one that supports
palloc but not pfree, repalloc, or any chunk-header-dependent
operations).


Hmmm, interesting idea. I've been thinking about doing this using a memory
context when writing the dense allocation, but was stuck in the "must
support all operations" mode, making it impossible. Disallowing some of the
operations would make it a viable approach, I guess.


FWIW, I've done that at some point. Noticeable speedups (that's what
I cared about), but a bit annoying to use. There's many random
pfree()s around, and then there's MemoryContextContains(),
GetMemoryChunkContext(), GetMemoryChunkSpace() - which all are
pretty fundamentally incompatible with such an allocator. I ended up
having a full header when assertions are enabled, to be able to
detect usage of these functions and assert out.

I didn't concentrate on improving memory usage, but IIRC it was even
noticeable for some simpler things.


I think the hassle is not that bad when most of the fragments have the 
same life cycle. With hashjoin that's almost exactly the case, except 
when we realize we need to increase the number of buckets - in that case 
we need to split the set of accumulated tuples in two.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Recently added typedef "string" is a horrid idea

2016-02-06 Thread Tom Lane
I see that commit b47b4dbf6 added this to varlena.c:

typedef struct varlena string;

This is a remarkably bad idea.  It will cause pgindent to do strange
things anywhere it sees a variable or field named "string", of which
we have quite a few.  Remember that the effects of typedef names are
*global*, so far as pgindent is concerned; not only varlena.c will
be affected.

Please rename this typedef with some less-generic name.  Probably
some of the other identifiers added in the same commit should be
adjusted to match.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> On 02/06/2016 08:39 PM, Andres Freund wrote:
>> FWIW, I've done that at some point. Noticeable speedups (that's what
>> I cared about), but a bit annoying to use. There's many random
>> pfree()s around, and then there's MemoryContextContains(),
>> GetMemoryChunkContext(), GetMemoryChunkSpace() - which all are
>> pretty fundamentally incompatible with such an allocator. I ended up
>> having a full header when assertions are enabled, to be able to
>> detect usage of these functions and assert out.
>> 
>> I didn't concentrate on improving memory usage, but IIRC it was even
>> noticeable for some simpler things.

> I think the hassle is not that bad when most of the fragments have the 
> same life cycle. With hashjoin that's almost exactly the case, except 
> when we realize we need to increase the number of buckets - in that case 
> we need to split the set of accumulated tuples in two.

Yeah, I think that a context type that just admits "we'll crash if you try
to pfree" would only be usable for allocations that are managed by just
a very small amount of code --- but the hashjoin tuple table qualifies,
and I think there would be other use-cases, perhaps tuplesort/tuplestore.

Andres' idea of adding a chunk header only in assert builds isn't a bad
one, perhaps; though I think the near-certainty of a core dump if you try
to use the header for anything might be good enough.  pfree and repalloc
are an ironclad certainty to crash in a pretty obvious way, and we could
likely add some assert checks to MemoryContextContains and friends to make
them 99.99% certain to fail without paying the price of a chunk header.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> On 02/06/2016 06:47 PM, Tom Lane wrote:
>> I note also that while the idea of ExecHashRemoveNextSkewBucket is
>> to reduce memory consumed by the skew table to make it available to
>> the main hash table, in point of fact it's unlikely that the freed
>> space will be of any use at all, since it will be in tuple-sized
>> chunks far too small for dense_alloc's requests. So the spaceUsed
>> bookkeeping being done there is an academic exercise unconnected to
>> reality, and we need to rethink what the space management plan is.

> I don't follow. Why would these three things (sizes of allocations in 
> skew buckets, chunks in dense allocator and accounting) be related?

Well, what we're trying to do is ensure that the total amount of space
used by the hashjoin table doesn't exceed spaceAllowed.  My point is
that it's kind of cheating to ignore space used-and-then-freed if your
usage pattern is such that that space isn't likely to be reusable.
A freed skew tuple represents space that would be reusable for another
skew tuple, but is probably *not* reusable for the main hash table;
so treating that space as interchangeable is wrong I think.

I'm not entirely sure where to go with that thought, but maybe the
answer is that we should just treat the skew table and main table
storage pools as entirely separate with independent limits.  That's
not what's happening right now, though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra



On 02/06/2016 09:55 PM, Tom Lane wrote:

Tomas Vondra  writes:

On 02/06/2016 06:47 PM, Tom Lane wrote:

I note also that while the idea of ExecHashRemoveNextSkewBucket is
to reduce memory consumed by the skew table to make it available to
the main hash table, in point of fact it's unlikely that the freed
space will be of any use at all, since it will be in tuple-sized
chunks far too small for dense_alloc's requests. So the spaceUsed
bookkeeping being done there is an academic exercise unconnected to
reality, and we need to rethink what the space management plan is.



I don't follow. Why would these three things (sizes of allocations in
skew buckets, chunks in dense allocator and accounting) be related?


Well, what we're trying to do is ensure that the total amount of
space used by the hashjoin table doesn't exceed spaceAllowed. My
point is that it's kind of cheating to ignore space
used-and-then-freed if your usage pattern is such that that space
isn't likely to be reusable. A freed skew tuple represents space that
would be reusable for another skew tuple, but is probably *not*
reusable for the main hash table; so treating that space as
interchangeable is wrong I think.


Ah, I see. And I agree that treating those areas as equal is wrong.


I'm not entirely sure where to go with that thought, but maybe the
answer is that we should just treat the skew table and main table
storage pools as entirely separate with independent limits. That's
not what's happening right now, though.


What about using the dense allocation even for the skew buckets, but not 
one context for all skew buckets but one per bucket? Then when we delete 
a bucket, we simply destroy the context (and free the chunks, just like 
we do with the current dense allocator).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimization- Check the set of conditionals on a WHERE clause against CHECK constraints.

2016-02-06 Thread Tom Lane
Shubham Barai  writes:
> I was searching for project ideas and found this

> 1.Optimization- Check the set of conditionals on a WHERE clause against
> CHECK constraints  on the table being queried and remove any conditionals
> which *must* be true due to the CHECK constraints.

> Is it expensive for  simple queries?

> will it optimize  general queries ?

TBH, it sounds like it would almost always be a waste of planning cycles.
I doubt that typical applications would issue many queries in which such
an optimization would apply.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> What about using the dense allocation even for the skew buckets, but not 
> one context for all skew buckets but one per bucket? Then when we delete 
> a bucket, we simply destroy the context (and free the chunks, just like 
> we do with the current dense allocator).

Yeah, I was wondering about that too, but it only works if you have quite
a few tuples per skew bucket, else you waste a lot of space.  And you were
right upthread that what we're collecting is keys expected to be common in
the outer table, not the inner table.  So it's entirely likely that the
typical case is just one inner tuple per skew bucket.  (Should check that
out though ...)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-06 Thread Michael Paquier
On Sun, Feb 7, 2016 at 2:49 AM, Andres Freund  wrote:
> On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
>> + /*
>> +  * Update the progress LSN positions. At least one WAL insertion lock
>> +  * is already taken appropriately before doing that, and it is just 
>> more
>> +  * simple to do that here where WAL record data and type is at hand.
>> +  * The progress is set at the start position of the record tracked that
>> +  * is being added, making easier checkpoint progress tracking as the
>> +  * control file already saves the start LSN position of the last
>> +  * checkpoint run.
>> +  */
>> + if (!isStandbySnapshot)
>> + {
>
> I don't like isStandbySnapshot much, it seems better to do this more
> generally, via a flag passed down by the inserter.

Instead of updating every single call of XLogInsert() in the system,
what do you think about introducing a new routine XLogInsertExtended()
that would have this optional flag? This would wrap the existing
XLogInsert() and pass the flag to XLogInsertRecord().
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Reveiw] Idle In Transaction Session Timeout, revived

2016-02-06 Thread Stéphane Schildknecht
On 31/01/2016 14:33, Vik Fearing wrote:
> Attached is a rebased and revised version of my
> idle_in_transaction_session_timeout patch from last year.
> 
> This version does not suffer the problems the old one did where it would
> jump out of SSL code thanks to Andres' patch in commit
> 4f85fde8eb860f263384fffdca660e16e77c7f76.
> 
> The basic idea is if a session remains idle in a transaction for longer
> than the configured time, that connection will be dropped thus releasing
> the connection slot and any locks that may have been held by the broken
> client.
> 
> Added to the March commitfest.
> 
> 
> 
> 
Hello,

I've looked at this patch, which I'd be able to review as a user, probably not
at a code level.
It seems to me this is a need in a huge number of badly handled idle in
transaction sessions (at application level).

This feature works as I expected it to.
My question would be regarding the value 0 assigned to the GUC parameter to
disable it. Wouldn't be -1 a better value, similar to
log_min_duration_statement or similar GUC parameter?

(I understand you can't put a 0ms timeout duration, but -1 seems more
understandable).

Best regards,
-- 
Stéphane Schildknecht
Contact régional PostgreSQL pour l'Europe francophone
Loxodata - Conseil, expertise et formations
06.17.11.37.42


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-06 Thread Tom Lane
Brendan Jurd  writes:
> On Sat, 6 Feb 2016 at 12:50 Tom Lane  wrote:
>> Yeah, I agree that a GUC for this is quite unappetizing.

> How would you feel about a variant for calling NOTIFY?

If we decide that this ought to be user-visible, then an extra NOTIFY
parameter would be the way to do it.  I'd much rather it "just works"
though.  In particular, if we do start advertising user control of
de-duplication, we are likely to start getting bug reports about every
case where it's inexact, eg the no-checks-across-subxact-boundaries
business.

> Optimising the remove-duplicates path is still probably a worthwhile
> endeavour, but if the user really doesn't care at all about duplication, it
> seems silly to force them to pay any performance price for a behaviour they
> didn't want, no?

I would only be impressed with that argument if it could be shown that
de-duplication was a significant fraction of the total cost of a typical
NOTIFY cycle.  Obviously, you can make the O(N^2) term dominate if you
try, but I really doubt that it's significant for reasonable numbers of
notify events per transaction.  One should also keep in mind that
duplicate events are going to cost extra processing on the
client-application side, too.  In my experience with using NOTIFY, that
cost probably dwarfs the cost of emitting the messages.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-06 Thread Andres Freund
On 2016-02-06 17:43:48 +0100, Tomas Vondra wrote:
> >Still the data is here... But well. I won't insist.
> 
> Huh? This thread started by an example how to cause loss of committed
> transactions. That fits my definition of "data loss" quite well.

Agreed, that view doesn't seem to make much sense. This clearly is a
data loss issue.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-06 Thread Andres Freund
On 2016-02-06 22:03:15 +0900, Michael Paquier wrote:
> The patch attached will apply on master, on 9.5 there is one minor
> conflict. For older versions we will need another reworked patch.

FWIW, I don't think we should backpatch this. It'd look noticeably
different in the back branches, and this isn't really a critical
issue. I think it makes sense to see this as an optimization.

> + /*
> +  * Update the progress LSN positions. At least one WAL insertion lock
> +  * is already taken appropriately before doing that, and it is just more
> +  * simple to do that here where WAL record data and type is at hand.
> +  * The progress is set at the start position of the record tracked that
> +  * is being added, making easier checkpoint progress tracking as the
> +  * control file already saves the start LSN position of the last
> +  * checkpoint run.
> +  */
> + if (!isStandbySnapshot)
> + {

I don't like isStandbySnapshot much, it seems better to do this more
generally, via a flag passed down by the inserter.

> + if (holdingAllLocks)
> + {
> + int i;
> +
> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
> + WALInsertLocks[i].l.progressAt = StartPos;

Why update all?

>  /*
> + * GetProgressRecPtr -- Returns the newest WAL activity position, aimed
> + * at the last significant WAL activity, or in other words any activity
> + * not referring to standby logging as of now. Finding the last activity
> + * position is done by scanning each WAL insertion lock by taking directly
> + * the light-weight lock associated to it.
> + */
> +XLogRecPtr
> +GetProgressRecPtr(void)
> +{
> + XLogRecPtr  res = InvalidXLogRecPtr;
> + int i;
> +
> + /*
> +  * Look at the latest LSN position referring to the activity done by
> +  * WAL insertion.
> +  */
> + for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
> + {
> + XLogRecPtr  progress_lsn;
> +
> + LWLockAcquire([i].l.lock, LW_EXCLUSIVE);
> + progress_lsn = WALInsertLocks[i].l.progressAt;
> + LWLockRelease([i].l.lock);

Add a comment explaining that we a) need a lock because of the potential
for "torn reads" on some platforms. b) need an exclusive one, because
the insert lock logic currently only expects exclusive locks.

>   /*
> +  * Fetch the progress position before taking any WAL insert lock. This
> +  * is normally an operation that does not take long, but leaving this
> +  * lookup out of the section taken an exclusive lock saves a couple
> +  * of instructions.
> +  */
> + progress_lsn = GetProgressRecPtr();

too long for my taste. How about:
/* get progress, before acuiring insert locks to shorten locked section */


Looks much better now.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tom Lane
Tomas Vondra  writes:
> On 02/06/2016 02:34 AM, Tom Lane wrote:
>> I have sussed what's happening in bug #13908. Basically, commit
>> 45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
>> memory") broke things for the case where a hash join is using a skew
>> table.

> Damn, that's an embarrassing oversight :-/

> I believe the attached patch should fix this by actually copying the 
> tuples into the densely allocated chunks. Haven't tested it though, will 
> do in a few hours.

Yeah, that's one fix approach I was contemplating last night.  (I think
the patch as written leaks memory and doesn't account for space usage
properly either, but certainly this is a direction we could take.)

The other answer I was thinking about was to get rid of the assumption
that iterating over the chunk storage is a valid thing to do, and
instead scan the hashbucket chains when we need to visit all tuples.
I really do not like the patch as designed, for several reasons:

* It incorporates a bespoke reimplementation of palloc into hash joins.
This is not a maintainable/scalable way to go about reducing memory
consumption.  It should have been done with an arm's-length API to a
new type of memory context, IMO (probably one that supports palloc but
not pfree, repalloc, or any chunk-header-dependent operations).

* No arm's-length API would conceivably allow remote callers to iterate
over all allocated chunks in the way this code does, which is why we need
to get rid of that behavior.

* There's no way to delete single tuples from the hash table given this
coding, which no doubt is why you didn't migrate the skew tuples into
this representation; but it doesn't seem like a very future-proof data
structure.

* Not doing anything for the skew tuples doesn't seem very good either,
considering the whole point of that sub-module is that there are likely
to be a lot of them.  I note also that while the idea of
ExecHashRemoveNextSkewBucket is to reduce memory consumed by the skew
table to make it available to the main hash table, in point of fact it's
unlikely that the freed space will be of any use at all, since it will be
in tuple-sized chunks far too small for dense_alloc's requests.  So the
spaceUsed bookkeeping being done there is an academic exercise unconnected
to reality, and we need to rethink what the space management plan is.

So I'm of the opinion that a great deal more work is needed here.
But it's not something we're going to be able to get done for 9.5.1,
or realistically 9.5.anything.  Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-06 Thread Filip Rembiałkowski
On Sat, Feb 6, 2016 at 5:52 PM, Tom Lane  wrote:
> Brendan Jurd  writes:
>> On Sat, 6 Feb 2016 at 12:50 Tom Lane  wrote:
>>> Yeah, I agree that a GUC for this is quite unappetizing.
>
>> How would you feel about a variant for calling NOTIFY?
>
> If we decide that this ought to be user-visible, then an extra NOTIFY
> parameter would be the way to do it.  I'd much rather it "just works"
> though.  In particular, if we do start advertising user control of
> de-duplication, we are likely to start getting bug reports about every
> case where it's inexact, eg the no-checks-across-subxact-boundaries
> business.

It is not enough to say "database server can decide to deliver a
single notification only." - which is already said in the docs?

The ALL keyword would be a clearly separated "do-nothing" version.

>
>> Optimising the remove-duplicates path is still probably a worthwhile
>> endeavour, but if the user really doesn't care at all about duplication, it
>> seems silly to force them to pay any performance price for a behaviour they
>> didn't want, no?
>
> I would only be impressed with that argument if it could be shown that
> de-duplication was a significant fraction of the total cost of a typical
> NOTIFY cycle.

Even if a typical NOTIFY cycle excludes processing 10k or 100k
messages, why penalize users who have bigger transactions?

> Obviously, you can make the O(N^2) term dominate if you
> try, but I really doubt that it's significant for reasonable numbers of
> notify events per transaction.

Yes, it is hard to observe for less than few thousands messages in one
transaction.
But big data happens. And then the numbers get really bad.
In my test for 40k messages, it is 400 ms versus 9 seconds. 22 times
slower. For 200k messages, it is 2 seconds  versus 250 seconds. 125
times slower.
And I tested with very short payload strings, so strcmp() had not much to do.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra

On 02/06/2016 06:47 PM, Tom Lane wrote:

Tomas Vondra  writes:

On 02/06/2016 02:34 AM, Tom Lane wrote:

I have sussed what's happening in bug #13908. Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table.



Damn, that's an embarrassing oversight :-/



I believe the attached patch should fix this by actually copying
the tuples into the densely allocated chunks. Haven't tested it
though, will do in a few hours.


Yeah, that's one fix approach I was contemplating last night. (I
think the patch as written leaks memory and doesn't account for
space usage properly either, but certainly this is a direction we
could take.)


Yes, it definitely needs more work (to free the original tuple copy 
after moving it into the dense_alloc chunk).



The other answer I was thinking about was to get rid of the
assumption that iterating over the chunk storage is a valid thing to
do, and instead scan the hashbucket chains when we need to visit all
tuples. I really do not like the patch as designed, for several
reasons:

* It incorporates a bespoke reimplementation of palloc into hash
joins. This is not a maintainable/scalable way to go about reducing
memory consumption. It should have been done with an arm's-length API
to a new type of memory context, IMO (probably one that supports
palloc but not pfree, repalloc, or any chunk-header-dependent
operations).


Hmmm, interesting idea. I've been thinking about doing this using a 
memory context when writing the dense allocation, but was stuck in the 
"must support all operations" mode, making it impossible. Disallowing 
some of the operations would make it a viable approach, I guess.



* No arm's-length API would conceivably allow remote callers to
iterate over all allocated chunks in the way this code does, which is
why we need to get rid of that behavior.


I'm not convinced we should throw away the idea of walking the chunks. I 
think it's kinda neat and I've been playing with postponing constructing 
the buckets until the very end of Hash build - it didn't work as good as 
expected, but I'm not ready to throw in the towel yet.


But perhaps the new memory context implementation could support some 
sort of iterator ...



* There's no way to delete single tuples from the hash table given
this coding, which no doubt is why you didn't migrate the skew tuples
into this representation; but it doesn't seem like a very
future-proof data structure.


I don't recall, it may be one of the reasons why the skew buckets use 
regular allocation. But I don't see how using a new type memory context 
could solve this, as it won't support pfree either. Maybe using a 
separate context for each skew bucket?



* Not doing anything for the skew tuples doesn't seem very good
either, considering the whole point of that sub-module is that there
are likely to be a lot of them.


Maybe I'm missing something, but I thought that the values tracked in 
skew buckets are the MCVs from the outer table, in the hope that we'll 
reduce the amount of data that needs to be spilled to disk when batching 
the outer relation.


I don't see why there should be a lot of them in the inner relation 
(well, I can imagine cases like that, but in my experience those are 
rare cases).



I note also that while the idea of ExecHashRemoveNextSkewBucket is
to reduce memory consumed by the skew table to make it available to
the main hash table, in point of fact it's unlikely that the freed
space will be of any use at all, since it will be in tuple-sized
chunks far too small for dense_alloc's requests. So the spaceUsed
bookkeeping being done there is an academic exercise unconnected to
reality, and we need to rethink what the space management plan is.


I don't follow. Why would these three things (sizes of allocations in 
skew buckets, chunks in dense allocator and accounting) be related?


FWIW the dense allocator actually made the memory accounting way more 
accurate, actually, as it eliminates most of the overhead that was not 
included in spaceUsed before.



So I'm of the opinion that a great deal more work is needed here. But
it's not something we're going to be able to get done for 9.5.1, or
realistically 9.5.anything. Whereas adding additional klugery to
ExecHashRemoveNextSkewBucket probably is doable over the weekend.


Agreed.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-06 Thread Tomas Vondra

Hi,

On 02/06/2016 01:16 PM, Michael Paquier wrote:

On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra
 wrote:

On 02/04/2016 09:59 AM, Michael Paquier wrote:


On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund  wrote:


On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:


And there is no actual risk of data loss



Huh?



More precise: what I mean here is that should an OS crash or a
power failure happen, we would fall back to recovery at next
restart, so we would not actually *lose* data.



Except that we actually can't perform the recovery properly
because we may not have the last WAL segment (or multiple
segments), so we can't replay the last batch of transactions. And
we don't even notice that.


Still the data is here... But well. I won't insist.


Huh? This thread started by an example how to cause loss of committed 
transactions. That fits my definition of "data loss" quite well.



Tomas, could you have a look at the latest patch I wrote? It would be
good to get fresh eyes on it. We could work on a version for ~9.4
once we have a clean approach for master/9.5.


Yep, I'll take a look - I've been out of office for the past 2 weeks, 
but I've been following the discussion and I agree with the changes 
discussed there (e.g. adding safe_rename and such).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-06 Thread Andrew Dunstan



On 02/05/2016 08:49 PM, Tom Lane wrote:


Yeah, I agree that a GUC for this is quite unappetizing.


Agreed.



One idea would be to build a hashtable to aid with duplicate detection
(perhaps only once the pending-notify list gets long).

Another thought is that it's already the case that duplicate detection is
something of a "best effort" activity; note for example the comment in
AsyncExistsPendingNotify pointing out that we don't collapse duplicates
across subtransactions.  Would it be acceptable to relax the standards
a bit further?  For example, if we only checked for duplicates among the
last N notification list entries (for N say around 100), we'd probably
cover just about all the useful cases, and the runtime would stay linear.
The data structure isn't tremendously conducive to that, but it could be
done.





I like the hashtable idea if it can be made workable.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimization- Check the set of conditionals on a WHERE clause against CHECK constraints.

2016-02-06 Thread Shubham Barai
Hello hackers,

I was searching for project ideas and found this

1.Optimization- Check the set of conditionals on a WHERE clause against
CHECK constraints  on the table being queried and remove any conditionals
which *must* be true due to the CHECK constraints.

Is it expensive for  simple queries?

will it optimize  general queries ?

Any thoughts will be helpful.
Thanks.


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Andres Freund
On 2016-02-06 20:34:07 +0100, Tomas Vondra wrote:
> On 02/06/2016 06:47 PM, Tom Lane wrote:
> >* It incorporates a bespoke reimplementation of palloc into hash
> >joins. This is not a maintainable/scalable way to go about reducing
> >memory consumption. It should have been done with an arm's-length API
> >to a new type of memory context, IMO (probably one that supports
> >palloc but not pfree, repalloc, or any chunk-header-dependent
> >operations).
> 
> Hmmm, interesting idea. I've been thinking about doing this using a memory
> context when writing the dense allocation, but was stuck in the "must
> support all operations" mode, making it impossible. Disallowing some of the
> operations would make it a viable approach, I guess.

FWIW, I've done that at some point. Noticeable speedups (that's what I
cared about), but a bit annoying to use. There's many random pfree()s
around, and then there's MemoryContextContains(),
GetMemoryChunkContext(), GetMemoryChunkSpace() - which all are pretty
fundamentally incompatible with such an allocator. I ended up having a
full header when assertions are enabled, to be able to detect usage of
these functions and assert out.

I didn't concentrate on improving memory usage, but IIRC it was even
noticeable for some simpler things.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-06 Thread Filip Rembiałkowski
+1

... and a patch (only adding ALL keyword, no hash table implemented yet).



On Sat, Feb 6, 2016 at 2:35 PM, Brendan Jurd  wrote:
> On Sat, 6 Feb 2016 at 12:50 Tom Lane  wrote:
>>
>> Robert Haas  writes:
>> > I agree with what Merlin said about this:
>> >
>> > http://www.postgresql.org/message-id/CAHyXU0yoHe8Qc=yc10ahu1nfia1tbhsg+35ds-oeueuapo7...@mail.gmail.com
>>
>> Yeah, I agree that a GUC for this is quite unappetizing.
>
>
> How would you feel about a variant for calling NOTIFY?
>
> The SQL syntax could be something like "NOTIFY [ALL] channel, payload" where
> the ALL means "just send the notification already, nobody cares whether
> there's an identical one in the queue".
>
> Likewise we could introduce a three-argument form of pg_notify(text, text,
> bool) where the final argument is whether you are interested in removing
> duplicates.
>
> Optimising the remove-duplicates path is still probably a worthwhile
> endeavour, but if the user really doesn't care at all about duplication, it
> seems silly to force them to pay any performance price for a behaviour they
> didn't want, no?
>
> Cheers,
> BJ
diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml
index 4dd5608..c148859 100644
--- a/doc/src/sgml/ref/notify.sgml
+++ b/doc/src/sgml/ref/notify.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-NOTIFY channel [ , payload ]
+NOTIFY [ ALL | DISTINCT ] channel [ , payload ]
 
  
 
@@ -105,6 +105,8 @@ NOTIFY channel [ , ALL is specified (contrary to DISTINCT, the
+   default), the server will deliver all notifications, including duplicates.
   
 
   
@@ -184,11 +186,14 @@ NOTIFY channel [ , 
 To send a notification you can also use the function
-pg_notify(text,
-text). The function takes the channel name as the
-first argument and the payload as the second. The function is much easier
-to use than the NOTIFY command if you need to work with
-non-constant channel names and payloads.
+pg_notify(channel text,
+payload text ,
+use_all boolean).
+The function takes the channel name as the first argument and the payload
+as the second. The third argument, false by default, represents
+the ALL keyword. The function is much easier to use than the
+NOTIFY command if you need to work with non-constant
+channel names and payloads.

   
  
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 923fe58..9df5301 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -965,3 +965,10 @@ RETURNS jsonb
 LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'jsonb_set';
+
+CREATE OR REPLACE FUNCTION
+  pg_notify(channel text, payload text, use_all boolean DEFAULT false)
+RETURNS void
+LANGUAGE INTERNAL
+VOLATILE
+AS 'pg_notify';
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c39ac3a..d374a00 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -510,6 +510,7 @@ pg_notify(PG_FUNCTION_ARGS)
 {
 	const char *channel;
 	const char *payload;
+	bool use_all;
 
 	if (PG_ARGISNULL(0))
 		channel = "";
@@ -521,10 +522,12 @@ pg_notify(PG_FUNCTION_ARGS)
 	else
 		payload = text_to_cstring(PG_GETARG_TEXT_PP(1));
 
+	use_all = PG_GETARG_BOOL(2);
+
 	/* For NOTIFY as a statement, this is checked in ProcessUtility */
 	PreventCommandDuringRecovery("NOTIFY");
 
-	Async_Notify(channel, payload);
+	Async_Notify(channel, payload, use_all);
 
 	PG_RETURN_VOID();
 }
@@ -540,7 +543,7 @@ pg_notify(PG_FUNCTION_ARGS)
  *		^^
  */
 void
-Async_Notify(const char *channel, const char *payload)
+Async_Notify(const char *channel, const char *payload, bool use_all)
 {
 	Notification *n;
 	MemoryContext oldcontext;
@@ -570,9 +573,10 @@ Async_Notify(const char *channel, const char *payload)
 	 errmsg("payload string too long")));
 	}
 
-	/* no point in making duplicate entries in the list ... */
-	if (AsyncExistsPendingNotify(channel, payload))
-		return;
+	if (!use_all)
+		/* remove duplicate entries in the list */
+		if (AsyncExistsPendingNotify(channel, payload))
+			return;
 
 	/*
 	 * The notification list needs to live until end of transaction, so store
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index b307b48..7203f4a 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -8528,11 +8528,12 @@ DropRuleStmt:
  *
  */
 
-NotifyStmt: NOTIFY ColId notify_payload
+NotifyStmt: NOTIFY all_or_distinct ColId notify_payload
 {
 	NotifyStmt *n = makeNode(NotifyStmt);
-	n->conditionname = $2;
-	n->payload = $3;
+	n->use_all = $2;
+	n->conditionname = $3;
+	n->payload = $4;
 	$$ = (Node *)n;
 }
 		;
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 045f7f0..0e50561 

Re: [HACKERS] Recently added typedef "string" is a horrid idea

2016-02-06 Thread Robert Haas
On Sat, Feb 6, 2016 at 5:11 PM, Tom Lane  wrote:
> I see that commit b47b4dbf6 added this to varlena.c:
>
> typedef struct varlena string;
>
> This is a remarkably bad idea.  It will cause pgindent to do strange
> things anywhere it sees a variable or field named "string", of which
> we have quite a few.  Remember that the effects of typedef names are
> *global*, so far as pgindent is concerned; not only varlena.c will
> be affected.
>
> Please rename this typedef with some less-generic name.  Probably
> some of the other identifiers added in the same commit should be
> adjusted to match.

Oops.  I didn't foresee that outcome.  I'm not sure offhand what else
to call it, but I suppose we can come up with something.
"charactertype", maybe?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex

2016-02-06 Thread Robert Haas
On Thu, Jan 21, 2016 at 9:03 AM, Anastasia Lubennikova
 wrote:
> First of all, why not merge both patches into one? They aren't too big
> anyway.

So looking over this patch, it institutes a new coding rule that all
shared hash tables must have the same number of partitions, and that
number is hard-coded at 128.  The number of freelist partitions is
also hardcoded at 128.  I find this design surprising.  First, I would
not have expected that we'd need 128 freelist partitions.  We get by
just fine with only 16 lock manager partitions, at least AFAIK, and
there's no reason some future hashtable might not have little enough
contention that 16 partitions works just fine.  Even for the buffer
manager, which does have 128 partitions, I wouldn't assume that we
need 128 partitions for the free list just because we need 128
partitions for the locks themselves.  It's possible we do, but I'd
expect that existing buffer mapping locking might dissipate some of
that contention - e.g. if somebody's doing a buffer lookup on a
particular partition, they have a shared lock on that partition, so
nobody has an exclusive lock to be able to hit the freelist.  So: do
we have clear evidence that we need 128 partitions here, or might,
say, 16 work just fine?

Second, I don't see any reason why the number of free list partitions
needs to be a constant, or why it needs to be equal to the number of
hash bucket partitions.  That just seems like a completely unnecessary
constraint.  You can take the hash value, or whatever else you use to
pick the partition, modulo any number you want.  I don't really want
to increase the number of lock manager partition locks to 128 just for
the convenience of this patch, and even less do I want to impose the
requirement that every future shared hash table must use an equal
number of partitions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-06 Thread Amit Kapila
On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund  wrote:
>
> Hi,
>
> currently if, when not in standby mode, we can't read a checkpoint
> record, we automatically fall back to the previous checkpoint, and start
> replay from there.
>
> Doing so without user intervention doesn't actually seem like a good
> idea. While not super likely, it's entirely possible that doing so can
> wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> tablespace being dropped - going back to the previous checkpoint very
> well could lead to replay not finishing, as the directory to create
> files in doesn't even exist.
>

I think there are similar hazards for deletion of relation when
relfilenode gets reused.  Basically, it can delete the data
for one of the newer relations which is created after the
last checkpoint.

> As there's, afaics, really no "legitimate" reasons for needing to go
> back to the previous checkpoint I don't think we should do so in an
> automated fashion.
>

I have tried to find out why at the first place such a mechanism has
been introduced and it seems to me that commit
4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
the reason is not apparent.  Then I digged through the archives
and found mail chain which I think has lead to this commit.
Refer [1][2].

If we want to do something for fallback-to-previous-checkpoint
mechanism, then I think it is worth considering whether we want
to retain xlog files from two checkpoints as that also seems to
have been introduced in the same commit.


> All the cases where I could find logs containing "using previous
> checkpoint record at" were when something else had already gone pretty
> badly wrong. Now that obviously doesn't have a very large significance,
> because in the situations where it "just worked" are unlikely to be
> reported...
>
> Am I missing a reason for doing this by default?
>

I am not sure, but may be such hazards won't exist at the time
fallback-to-previous-checkpoint mechanism has been introduced.

I think even if we want to make it non-default, it will be very
difficult for users to decide whether to turn it on or not.  Basically,
I think if such a situation occurs, what ever solution we try to
provide to user, it might not be full-proof, but OTOH we should
provide some way to allow user to start database and dump the
existing contents. Some of the options that comes to mind are
provide some way to get the last checkpoint record from WAL
or provide a way to compute max-lsn from data-pages and use
that with pg_resetxlog utility to allow user to start database.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2016-02-06 Thread Robert Haas
On Sat, Feb 6, 2016 at 12:46 AM, Ashutosh Bapat
 wrote:
> Here it is rebased. Thanks for the pgindent run and committing core changes.
> I have to manage only one patch now :)
>
> pgindent is giving trouble with following two comments
>
> 2213 /* Run time cost includes:
> 2214  * 1. Run time cost (total_cost - startup_cost) of
> relations being
> 2215  *joined
> 2216  * 2. Run time cost of applying join clauses on the cross
> product of
> 2217  *the joining relations.
> 2218  * 3. Run time cost of applying pushed down other clauses
> on the
> 2219  *result of join
> 2220  * 4. Run time cost of applying nonpushable other clauses
> locally
> 2221  *on the result fetched from the foreign server.
>   */
>
> which I want itemized with each item starting on separate line. pgindent
> just bunches everything together.

The thing to do here is leave a blank line between each one.  You can
also put a line of dashes before and after the comment (see many
examples elsewhere in the source tree) to force pgindent to leave that
section completely untouched, but I think that this sort of list looks
better with blank lines anyway, so I'd go for that solution.

> 1159 /*
> 1160  * For a join relation FROM clause entry is deparsed as
> 1161  * ((outer relation)  (inner relation) ON
> (joinclauses)
> 1162  */
> where I want the second line as a separate line, but pgindent puts those two
> line together breaking the continuity of second line content.
>
> How do I make pgindent respect those changes as they are?

Same idea here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "using previous checkpoint record at" maybe not the greatest idea?

2016-02-06 Thread Amit Kapila
On Sun, Feb 7, 2016 at 10:54 AM, Amit Kapila 
wrote:
>
> On Tue, Feb 2, 2016 at 5:28 AM, Andres Freund  wrote:
> >
> > Hi,
> >
> > currently if, when not in standby mode, we can't read a checkpoint
> > record, we automatically fall back to the previous checkpoint, and start
> > replay from there.
> >
> > Doing so without user intervention doesn't actually seem like a good
> > idea. While not super likely, it's entirely possible that doing so can
> > wreck a cluster, that'd otherwise easily recoverable. Imagine e.g. a
> > tablespace being dropped - going back to the previous checkpoint very
> > well could lead to replay not finishing, as the directory to create
> > files in doesn't even exist.
> >
>
> I think there are similar hazards for deletion of relation when
> relfilenode gets reused.  Basically, it can delete the data
> for one of the newer relations which is created after the
> last checkpoint.
>
> > As there's, afaics, really no "legitimate" reasons for needing to go
> > back to the previous checkpoint I don't think we should do so in an
> > automated fashion.
> >
>
> I have tried to find out why at the first place such a mechanism has
> been introduced and it seems to me that commit
> 4d14fe0048cf80052a3ba2053560f8aab1bb1b22 has introduced it, but
> the reason is not apparent.  Then I digged through the archives
> and found mail chain which I think has lead to this commit.
> Refer [1][2].
>

oops, forgot to provide the links, providing them now.

[1] - http://www.postgresql.org/message-id/21559.983467...@sss.pgh.pa.us
[2] - http://www.postgresql.org/message-id/17254.984448...@sss.pgh.pa.us

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] pgbench small bug fix

2016-02-06 Thread Robert Haas
On Wed, Jan 27, 2016 at 2:31 PM, Fabien COELHO  wrote:
>
> While testing for something else I encountered two small bugs under very low
> rate (--rate=0.1). The attached patches fixes these.
>
>  - when a duration (-T) is specified, ensure that pgbench ends at that
>time (i.e. do not wait for a transaction beyond the end of the run).

Why does this use INSTR_TIME_GET_DOUBLE() and not INSTR_TIME_GET_MICROSEC()?

Also, why do we really need this change?  Won't the timer expiration
stop the thread at the right time anyway?  I mean, sure, in theory
it's wasteful for the thread to sit around doing nothing waiting for
the timer to expire, but it's not evident to me that hurts anything,
really.

>  - when there is a progress (-P) report, ensure that all progress
>reports are shown even if no more transactions are schedule.

That's pretty ugly - it would be easy for the test at the top of the
loop to be left out of sync with the similar test inside the loop by
some future patch.  And again, I wonder why this is really a bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Make timestamptz_out less slow.

2016-02-06 Thread Tom Lane
David Rowley  writes:
[ timestamp_out_speedup_2015-11-05.patch ]

Pushed with a bunch of cosmetic tweaks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: CustomScan in a larger structure (RE: [HACKERS] CustomScan support on readfuncs.c)

2016-02-06 Thread Robert Haas
On Wed, Feb 3, 2016 at 11:57 PM, Kouhei Kaigai  wrote:
> At this moment, I tried to write up description at nodes/nodes.h.
> The amount of description is about 100lines. It is on a borderline
> whether we split off this chunk into another header file, in my sense.
>
>
> On the other hands, I noticed that we cannot omit checks for individual
> callbacks on Custom node type, ExtensibleNodeMethods is embedded in
> the CustomMethods structure, thus we may have Custom node with
> no extensible feature.
> This manner is beneficial because extension does not need to register
> the library and symbol name for serialization. So, CustomScan related
> code still checks existence of individual callbacks.

I was looking over this patch yesterday, and something was bugging me
about it, but I couldn't quite put my finger on what it was.  But now
I think I know.

I think of an extensible node type facility as something that ought to
be defined to allow a user to create new types of nodes.  But this is
not that.  What this does is allows you to have a CustomScan or
ForeignScan node - that is, the existing node type - which is actually
larger than a normal node of that type and has some extra data that is
part of it.  I'm having a really hard time being comfortable with that
concept.  Somehow, it seems like the node type should determine the
size of the node.  I can stretch my brain to the point of being able
to say, well, maybe if the node tag is T_ExtensibleNode, then you can
look at char *extnodename to figure out what type of node it is
really, and then from there get the size.  But what this does is:
every time you see a T_CustomScan or T_ForeignScan node, it might not
really be that kind of node but something else else, and tomorrow
there might be another half-dozen node types with a similar property.
And every one of those node types will have char *extnodename in a
different place in the structure, so a hypothetical piece of code that
wanted to find the extension methods for a node, or the size of a
node, would need a switch that knows about all of those node types.
It feels very awkward.

So I have a slightly different proposal.  Let's forget about letting
T_CustomScan or T_ForeignScan or any other built-in node type vary in
size.  Instead, let's add T_ExtensibleNode which acts as a completely
user-defined node.  It has read/out/copy/equalfuncs support along the
lines of what this patch implements, and that's it.  It's not a scan
or a path or anything like that: it's just an opaque data container
that the system can input, output, copy, and test for equality, and
nothing else.  Isn't that useless?  Not at all.  If you're creating an
FDW or custom scan provider and want to store some extra data, you can
create a new type of extensible node and stick it in the fdw_private
or custom_private field!  The data won't be part of the CustomScan or
ForeignScan structure itself, as in this patch, but who cares?  The
only objection I can see is that you still need several pointer
deferences to get to the data since fdw_private is a List, but if
that's actually a real performance problem we could probably fix it by
just changing fdw_private to a Node instead.  You'd still need one
pointer dereference, but that doesn't seem too bad.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Explanation for bug #13908: hash joins are badly broken

2016-02-06 Thread Tomas Vondra

On 02/06/2016 02:34 AM, Tom Lane wrote:

I have sussed what's happening in bug #13908. Basically, commit
45f6240a8fa9d355 ("Pack tuples in a hash join batch densely, to save
memory") broke things for the case where a hash join is using a skew
table. The reason is that that commit only changed the storage of
tuples going into the main hash table; tuples going into the skew
table are still allocated with a palloc apiece, without being put
into the "chunk" storage. Now, if we're loading the hash table and we
find that we've exceeded the storage allowed for skew tuples,
ExecHashRemoveNextSkewBucket wants to push some skew tuples back into
the main hash table; and it believes that linking such tuples into
the appropriate main hashbucket chain is sufficient for that. Which
it was before the aforesaid commit, and still is in simple cases.
However, if we later try to do ExecHashIncreaseNumBatches, that
function contains new code that assumes that it can find all tuples
in the main hashtable by scanning the "chunk" storage directly. Thus,
the pushed-back tuples are not scanned and are neither re-entered
into the hash table nor dumped into a batch file. So they won't get
joined.


Damn, that's an embarrassing oversight :-/

I believe the attached patch should fix this by actually copying the 
tuples into the densely allocated chunks. Haven't tested it though, will 
do in a few hours.



It looks like ExecHashIncreaseNumBuckets, if it were to run after
some executions of ExecHashRemoveNextSkewBucket, would break things
in the same way. That's not what's happening in this particular test
case, though.


ExecHashIncreaseNumBuckets assumes all the tuples can be reached by 
simply walking the chunks (from dense_alloc). So if removing skew bucket 
only updates pointers in buckets, that gets broken. But I don't think 
that's a bug in ExecHashIncreaseNumBuckets and should be resolved by 
fixing ExecHashRemoveNextSkewBucket.



I'm of the opinion that this is a stop-ship bug for 9.5.1. Barring
somebody producing a fix over the weekend, I will deal with it by
reverting the aforementioned commit.


I think it's not quite possible to revert just the one commit as the 
other hashjoin improvements in 9.5 built on top of that. So the revert 
would either be quite invasive (requiring more code changes than the 
fix), or we'd have to revert all the hashjoin goodies.


FWIW I'm willing to put some time into fixing this over the weekend.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8a8fdf2..653faf5 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1576,9 +1576,16 @@ ExecHashRemoveNextSkewBucket(HashJoinTable hashtable)
 		/* Decide whether to put the tuple in the hash table or a temp file */
 		if (batchno == hashtable->curbatch)
 		{
+			/* keep tuple in memory - copy it into the new chunk */
+			HashJoinTuple copyTuple;
+
+			copyTuple = (HashJoinTuple) dense_alloc(hashtable, tupleSize);
+			memcpy(copyTuple, hashTuple, tupleSize);
+
 			/* Move the tuple to the main hash table */
-			hashTuple->next = hashtable->buckets[bucketno];
-			hashtable->buckets[bucketno] = hashTuple;
+			copyTuple->next = hashtable->buckets[bucketno];
+			hashtable->buckets[bucketno] = copyTuple;
+
 			/* We have reduced skew space, but overall space doesn't change */
 			hashtable->spaceUsedSkew -= tupleSize;
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] silent data loss with ext4 / all current versions

2016-02-06 Thread Michael Paquier
On Sat, Feb 6, 2016 at 2:11 AM, Tomas Vondra
 wrote:
> On 02/04/2016 09:59 AM, Michael Paquier wrote:
>>
>> On Tue, Feb 2, 2016 at 9:59 AM, Andres Freund  wrote:
>>>
>>> On 2016-02-02 09:56:40 +0900, Michael Paquier wrote:

 And there is no actual risk of data loss
>>>
>>>
>>> Huh?
>>
>>
>> More precise: what I mean here is that should an OS crash or a power
>> failure happen, we would fall back to recovery at next restart, so we
>> would not actually *lose* data.
>
>
> Except that we actually can't perform the recovery properly because we may
> not have the last WAL segment (or multiple segments), so we can't replay the
> last batch of transactions. And we don't even notice that.

Still the data is here... But well. I won't insist. Tomas, could you
have a look at the latest patch I wrote? It would be good to get fresh
eyes on it. We could work on a version for ~9.4 once we have a clean
approach for master/9.5.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-06 Thread Michael Paquier
On Thu, Feb 4, 2016 at 6:38 PM, Michael Paquier
 wrote:
> On Thu, Feb 4, 2016 at 4:10 PM, Andres Freund  wrote:
>> On 2016-02-04 18:21:41 +0530, Amit Kapila wrote:
>>> I think generally it is good idea, but one thing worth a thought is that
>>> by doing so, we need to acquire all WAL Insertion locks every
>>> LOG_SNAPSHOT_INTERVAL_MS to check the last_insert_pos for
>>> every slot, do you think it is matter of concern in any way for write
>>> workloads or it won't effect as we need to do this periodically?
>>
>> Michael and I just had an in-person discussion, and one of the topics
>> was that. The plan was basically to adapt the patch to:
>> 1) Store the progress lsn inside the wal insert lock
>> 2) Change the HasActivity API to return an the last LSN at which there
>>was activity, instead of a boolean.
>> 3) Individually acquire each insert locks's lwlock to get it's progress
>>LSN, but not the exclusive insert lock. We need the lwllock to avoid
>>a torn 8byte read on some platforms.
>
> 4) Switch the condition to decide if a checkpoint should be skipped
> using the last activity position compared with ProcLastRecPtr in
> CreateCheckpoint to see if any activity has occurred since the
> checkpoint record was inserted, and do not care anymore if the
> previous record and current record are on different segments. This
> would basically work.

OK, attached is a patch that I believe addresses those issues. The
patch still has a couple of LOG entries that we had better remove in
the version that gets pushed, but they have proved to be useful for me
when testing the patch with a low checkpoint_timeout value to see if
checkpoints are properly skipped on an idle system. I found myself
adding another routine called GetLastCheckpointRecPtr() for the
bgwriter because ControlFile is not declared externally even if it is
in shared memory. I think that's better this way.

The original bug report referred to a low archive_timeout causing
standby snapshots to be logged when segments are switched. So we may
want at the end to not update the progress LSN for segment switch
records as well, but I have let that out of the patch for the time
being to address the primary concern of unnecessary checkpoints for
wal_level >= hs. We could address it with a later patch (planning to
do so), let's keep a step-by-step approach.

The patch attached will apply on master, on 9.5 there is one minor
conflict. For older versions we will need another reworked patch. I am
fine to produce those once we are fine with the shape of what gets
into master and 9.5. 9.4 uses WAL insert locks so things are rather
similar. For ~9.3, I think that we are going to need a single variable
in XLogCtl or similar to track the progress, but I have not looked
into that in details yet.
-- 
Michael


hs-checkpoints-v4.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: make NOTIFY list de-duplication optional

2016-02-06 Thread Brendan Jurd
On Sat, 6 Feb 2016 at 12:50 Tom Lane  wrote:

> Robert Haas  writes:
> > I agree with what Merlin said about this:
> >
> http://www.postgresql.org/message-id/CAHyXU0yoHe8Qc=yc10ahu1nfia1tbhsg+35ds-oeueuapo7...@mail.gmail.com
>
> Yeah, I agree that a GUC for this is quite unappetizing.
>

How would you feel about a variant for calling NOTIFY?

The SQL syntax could be something like "NOTIFY [ALL] channel, payload"
where the ALL means "just send the notification already, nobody cares
whether there's an identical one in the queue".

Likewise we could introduce a three-argument form of pg_notify(text, text,
bool) where the final argument is whether you are interested in removing
duplicates.

Optimising the remove-duplicates path is still probably a worthwhile
endeavour, but if the user really doesn't care at all about duplication, it
seems silly to force them to pay any performance price for a behaviour they
didn't want, no?

Cheers,
BJ