Re: [HACKERS] [PATCH] parallel & isolated makefile for plpython

2016-10-01 Thread Pavel Raiskup
Hi Tom,

On Saturday, October 1, 2016 12:23:03 PM CEST Tom Lane wrote:
> Hm, actually that's unnecessary because Makefile.global already
> established 'all' as the default target.  I'm inclined to think
> that the comment in Makefile.shlib is wrong and should be removed
> or at least rewritten, because I doubt it would work at all to
> include Makefile.shlib before Makefile.global; certainly that's
> not the intended usage.

correct analysis, the patch in git fixing this is also correct.  Thanks!

> > ! # gram.y=>gram.c.  Run `info make -n "Empty Recipes"` for more info.
>   
> I'm okay with pointing people at that part of the gmake docs, but don't
> care for insisting on exactly one way of accessing those docs.  How
> about something like "... See "Empty Recipes" in the GNU Make docs for
> more info."?
> 
> [ checks around... ]  Wait, actually there's a bigger problem: this
> advice is apparently gmake-version-specific.  I don't see any
> such section title in RHEL6's version of the gmake docs (make 3.81).
> I do find a section titled "Using Empty Commands", which looks like
> it might be the same thing, but a person looking for "Empty Recipes" is
> not going to find that.  Is it worth the trouble to provide this pointer
> if we need a lot of weasel wording about how the section name is spelled?

Valid points.  Let's forget about that patch.  I like pointing people to use
'info' directly instead of googling for link, but PostgreSQL docs is not a
correct place for this.

To the point; I was confused by two things.  First that (a) docs claim that
there is no _correct_ (instead of probably? _portable_) way to write one
rule for two "targets" and (b) that the comment for ';' ("The semicolon is
important, otherwise make will choose the built-in rule for ...") looked
too magically (without actually saying what it does).  I wrongly
concentrated on (b) only before.

First suggestion would be something towards:

  -# There is no correct way to write a rule that generates two files.
  +# This is a way to write a rule (for 'gram.c') that generates two files.

And the second:

  -# make, we must chain the dependencies like this.  The semicolon is
  -# important, otherwise make will choose the built-in rule for
  -# gram.y=>gram.c.
  +# make, we must chain the dependencies like this.  The semicolon (Empty
  +# Command/Recipe) is important, otherwise make will choose the built-in
  +# rule for gram.y=>gram.c.

FTR, also note that something like this could work:

  all: xxx.c xxx.h
  xxx%c xxx%h: xxx.txt
touch xxx.c xxx.h
  # info make -n "Pattern Intro", grep for 'multiple targets'

But note that we don't have iterate here at all, because probably that was
just my problem (having a look at make docs resolved it ...).

Thanks again for the fix in git!

Pavel



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-01 Thread Tomas Vondra

On 10/02/2016 12:23 AM, John Gorman wrote:

I reproduced the quadradic pfree performance problem and verified that
these patches solved it.

The slab.c data structures and functions contain no quadradic components.

I noticed the sizing loop in SlabContextCreate() and came up with
a similar formula to determine chunksPerBlock that you arrived at.


Firstly, I've realized there's an issue when chunkSize gets too
large - once it exceeds blockSize, the SlabContextCreate() fails
as it's impossible to place a single chunk into the block. In
reorderbuffer, this may happen when the tuples (allocated in
tup_context) get larger than 8MB, as the context uses
SLAB_LARGE_BLOCK_SIZE (which is 8MB).

But maybe there's a simpler solution - we may simply cap the
chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because
AllocSet handles those requests in a special way - for example
instead of tracking them in freelist, those chunks got freed
immediately.


I like this approach because it fixes the performance problems
with smaller allocations and doesn't change how larger
allocations are handled.



One more comment about GenSlab, particularly about unpredictability of 
the repalloc() behavior. It works for "large" chunks allocated in the 
AllocSet part, and mostly does not work for "small" chunks allocated in 
the SlabContext. Moreover, the chunkSize changes over time, so for two 
chunks of the same size, repalloc() may work on one of them and fail on 
the other, depending on time of allocation.


With SlabContext it's perfectly predictable - repalloc() call fails 
unless the chunk size is exactly the same as before (which is perhaps a 
bit pointless, but if we decide to fail even in this case it'll work 
100% time).


But with GenSlabContext it's unclear whether the call fails or not.

I don't like this unpredictability - I'd much rather have consistent 
failures (making sure people don't do repalloc() on with GenSlab). But I 
don't see a nice way to achieve that - the repalloc() call does not go 
through GenSlabRealloc() at all, but directly to SlabRealloc() or 
AllocSetRealloc().


The best solution I can think of is adding an alternate version of 
AllocSetMethods, pointing to a different AllocSetReset implementation.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Tomas Vondra

On 10/02/2016 02:17 AM, Andres Freund wrote:

Hi,

> ...



I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).


Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?



Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking about
hashjoins, but now that I think of it - that's a fairly specialized and
tuned code. In any case, let's not complicate the discussion for now.


My point is that I don't really see any potential usecase where
that's an issue.



OK, I think we agree the two places modified by the submitted patch 
series are fine. Let's leave discussion about places modified by future 
patches for the future ;-)





A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?


For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.



Hmmm, OK. I have my doubts about those hardcoded constants, but you're 
right 256 is probably an overkill.





5) SH_RESIZE

Do we expect to shrink hash tables?


Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.



OK, sure. Then what about adding this to SH_RESIZE?

Assert(oldsize <= newsize);

I assumed we might shrink the catcache after invalidation, for example 
(but maybe I don't really know how that works).





It's not entirely clear why is it guaranteed that there's always
an element with optimal position (when load factor < 1)? Adding an
 explanation or a link to a paper would be helpful, I guess.


As the load factor always has to be below 1, there always has to be
an empty bucket. Every bucket after an empty bucket has to be at
it's optimal position.



Hmmm, thanks to moving the entries after delete? Adding an assert() 
enforcing this seems like a good idea.





7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After all,
that's what hashjoins do.


We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.



That's a case of glass half-full vs. half-empty, I guess. If we assumed 
load factor 1.0, then I see it as accounting for load factor (while you 
see it as not accounting of it).


I don't see why this should cause issues - of course, the hash table may 
be a bit larger, exceed work_mem and make the tidbitmap lossy, or switch 
HashAggregate to GroupAggregate. But I don't think that's a major issue, 
as those decisions depend on estimates anyway, so we can't really 
guarantee them.


Also, with load factor 0.8 the table is only ~20% larger, so even if we 
don't include that into work_mem it's a reasonably small error (easily 
smaller than errors in the pre-9.5 HashJoin accounting, which did not 
include chunk headers etc.).


So I won't fight for this, but I don't see why not to account for it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-01 Thread Tomas Vondra

On 10/02/2016 12:23 AM, John Gorman wrote:

I reproduced the quadradic pfree performance problem and verified
that these patches solved it.

The slab.c data structures and functions contain no quadradic
components.

I noticed the sizing loop in SlabContextCreate() and came up with a
similar formula to determine chunksPerBlock that you arrived at.



;-)


Firstly, I've realized there's an issue when chunkSize gets too
large - once it exceeds blockSize, the SlabContextCreate() fails
as it's impossible to place a single chunk into the block. In
reorderbuffer, this may happen when the tuples (allocated in
tup_context) get larger than 8MB, as the context uses
SLAB_LARGE_BLOCK_SIZE (which is 8MB).

But maybe there's a simpler solution - we may simply cap the
chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because
AllocSet handles those requests in a special way - for example
instead of tracking them in freelist, those chunks got freed
immediately.


I like this approach because it fixes the performance problems
with smaller allocations and doesn't change how larger
allocations are handled.



Right.


In slab.c it looks like a line in the top comments could be clearer.
Perhaps this is what is meant.

< *  (plus alignment), now wasting memory.

*  (plus alignment), not wasting memory.


In slab.c some lines are over 80 characters could be folded.

It would be nice to give each patch version a unique file name.



OK, will fix.


Nice patch, I enjoyed reading it!



;-)

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-01 Thread Tomas Vondra

On 10/02/2016 01:53 AM, Jim Nasby wrote:

On 9/26/16 9:10 PM, Tomas Vondra wrote:

Attached is v2 of the patch, updated based on the review. That means:


+/* make sure the block can store at least one chunk (with 1B for a
bitmap)? */
(and the comment below it)

I find the question to be confusing... I think these would be better as

+/* make sure the block can store at least one chunk (with 1B for a
bitmap) */
and
+/* number of chunks can we fit into a block, including header and
bitmap */



Thanks, will rephrase the comments a bit.


I'm also wondering if a 1B freespace bitmap is actually useful. If
nothing else, there should probably be a #define for the initial
bitmap size.


That's not the point. The point is that we need to store at least one 
entry per block, with one bit in a bitmap. But we can't store just a 
single byte - we always allocate at least 1 byte. So it's more an 
explanation for the "1" literal in the check, nothing more.




+/* otherwise add it to the proper freelist bin */
Looks like something went missing... :)



Ummm? The patch contains this:

+   /* otherwise add it to the proper freelist bin */
+   if (set->freelist[block->nfree])
+   set->freelist[block->nfree]->prev = block;
+
+   block->next = set->freelist[block->nfree];
+   set->freelist[block->nfree] = block;

Which does exactly the thing it should do. Or what is missing?



Should zeroing the block in SlabAlloc be optional like it is with
palloc/palloc0?



Good catch. The memset(,0) should not be in SlabAlloc() as all, as the 
zeroing is handled in mctx.c, independently of the implementation.



+/*
+ * If there are no free chunks in any existing block, create a new
block
+ * and put it to the last freelist bucket.
+ */
+if (set->minFreeCount == 0)
Couldn't there be blocks that have a free count > minFreeCount? (In
which case you don't want to just alloc a new block, no?)

Nevermind, after reading further down I understand what's going on. I
got confused by "we've decreased nfree for a block with the minimum
count" until I got to the part that deals with minFreeCount = 0. I think
it'd be helpful if the "we've decreased nfree" comment mentioned that if
nfree is 0 we'll look for the correct value after updating the freelists.



Right. I think it'd be good to add an assert ensuring the minFreeCount 
value matches the block freelist. Or at least SlabCheck() could check this.



+block->bitmapptr[idx/8] |= (0x01 << (idx % 8));
Did you consider using a pre-defined map of values to avoid the shift? I
know I've seen that somewhere in the code...

Patch 2...

Doesn't GenSlabReset() need to actually free something? If the call
magically percolates to the other contexts it'd be nice to note that in
the comment.


No, the other contexts are created as children of GenSlab, so the memory 
context infrastructure resets them automatically. I don't think this 
needs to be mentioned in the comments - it's pretty basic part of the 
parent-child context relationship.


Thanks!

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Andres Freund
Hi,


On 2016-10-02 01:37:35 +0200, Tomas Vondra wrote:
> On 10/01/2016 09:59 PM, Andres Freund wrote:
> >>What about running a bigger benchmark - say, TPC-DS - and evaluating
> >>the impact?
> >
> >Worthwhile, although obviously the impact will be a lot smaller,
> >since they're not entirely bottlenecked on hash-aggs and bitmap
> >scans.
> >
> 
> Sure, the improvement won't be as significant as on the simple queries, but
> it's interesting IMHO.

Agreed.


> >>I think a crucial part of the benchmarking will be identifying and
> >>measuring corner cases, e.g. a lot of rows with the same key, etc.
> >>Although that probably is not a major issue for the two places
> >>switched to the new implementation (e.g. execGrouping merges the
> >>duplicates to a single group, by definition).
> >
> >Hm. I don't really see a case where that's going to cause issues - all
> >the execGrouping.c users only store one key, and either ignore later
> >ones, or add the data to the initial tuple in the group.  I don't really
> >see any case where repeated keys should cause an issue for a hash table?
> >
> 
> Yep, that's pretty much what I suggested. I was wondering more about the
> other places where this hash table might be used - I've been thinking about
> hashjoins, but now that I think of it - that's a fairly specialized and
> tuned code. In any case, let's not complicate the discussion for now.

My point is that I don't really see any potential usecase where that's
an issue.


> A few comments, after quickly skimming through the first two patches:
> 
> 1) SH_CREATE does this:
> 
>   /* round up size to the next power of 2, eases lookups */
>   if (size < 2)
>   size = 2;
>   else
>   size = sh_pow2_int(size);
> 
> I very much doubt a hash table with 2 buckets is very useful. What about
> using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
> which might be a bit too much, but perhaps 256 would work?

For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.


> 2) tb->resize_above
> 
> I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as
> a constant somewhere (not sure if it makes sense to make it part of SH_TYPE,
> so that hash tables may use different load factors).

Fair point.


> 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
> SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
> size in bytes' or 'number of entries'.

Ok.


> 4) SH_CONTAINS sounds more like a function checking whether a hash table
> contains a key, not as a 'type of hash table entry'. What about
> SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

Hm, ok.


> 5) SH_RESIZE
> 
> Do we expect to shrink hash tables?

Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.


> It's not entirely clear why is it guaranteed that there's always an element
> with optimal position (when load factor < 1)? Adding an explanation or a
> link to a paper would be helpful, I guess.

As the load factor always has to be below 1, there always has to be an
empty bucket. Every bucket after an empty bucket has to be at it's
optimal position.

> 
> 6) SH_STAT
> 
> This bit is a bit suspicious:
> 
> uint32 max_collisions = 0;
> 
> ...
> 
> /* single contained element is not a collision */
> curcoll--;
> total_collisions += curcoll;
> if (curcoll > max_collisions)
> max_collisions = curcoll - 1;
> 
> Shouldn't the last line be just "max_collisions = curcoll"?

Hm. That's probably a refactoring artefact. Nice catch.


> 7) Should hash agg size estimation for the planner consider the fillfactor?
> 
> I think we should account for fillfactor - we should account for the
> allocated memory, even if there's free space in the hash table. After all,
> that's what hashjoins do.

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.


Thanks!


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-01 Thread Jim Nasby

On 9/26/16 9:10 PM, Tomas Vondra wrote:

Attached is v2 of the patch, updated based on the review. That means:


+	/* make sure the block can store at least one chunk (with 1B for a 
bitmap)? */

(and the comment below it)

I find the question to be confusing... I think these would be better as

+	/* make sure the block can store at least one chunk (with 1B for a 
bitmap) */

and
+	/* number of chunks can we fit into a block, including header and 
bitmap */


I'm also wondering if a 1B freespace bitmap is actually useful. If 
nothing else, there should probably be a #define for the initial bitmap 
size.


+   /* otherwise add it to the proper freelist bin */
Looks like something went missing... :)


Should zeroing the block in SlabAlloc be optional like it is with 
palloc/palloc0?


+   /*
+* If there are no free chunks in any existing block, create a new block
+* and put it to the last freelist bucket.
+*/
+   if (set->minFreeCount == 0)
Couldn't there be blocks that have a free count > minFreeCount? (In 
which case you don't want to just alloc a new block, no?)


Nevermind, after reading further down I understand what's going on. I 
got confused by "we've decreased nfree for a block with the minimum 
count" until I got to the part that deals with minFreeCount = 0. I think 
it'd be helpful if the "we've decreased nfree" comment mentioned that if 
nfree is 0 we'll look for the correct value after updating the freelists.


+   block->bitmapptr[idx/8] |= (0x01 << (idx % 8));
Did you consider using a pre-defined map of values to avoid the shift? I 
know I've seen that somewhere in the code...


Patch 2...

Doesn't GenSlabReset() need to actually free something? If the call 
magically percolates to the other contexts it'd be nice to note that in 
the comment.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Tomas Vondra

On 10/01/2016 09:59 PM, Andres Freund wrote:

Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined.  There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.


Attached is a significantly improved version of this series.  The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
  commit message of 0002 (or simplehash.h) for an introduction. That
  significantly reduces the impact of "clustering" due to linear
  addressing.


Interesting. Have you looked at cuckoo hashing?


Yes. I don't think it's a good fit for modern CPUs. Because it
doesn't probe linearly you constantly have random uncached memory
accesses.

I've played with a few schemes, and they all seemed to be slower
than linear probing deviations, unless you intentionally used a
wrong hash-distribution, or intentionally "unbalanced" the hash-space
by iterating over either end of the keyspace and moved the entries
around.


OK, understood.




- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.


Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.



Well, I have rather bad experience with running benchmarks on laptops anyway
- a lot of noise due to power management, etc.


Well, with factor ~2x improvements thats not really a big detractor.
Using a new CPU makes some forms of analysis easier (better PMUs).



True.

>

For longrunning tests I agree.



What about running a bigger benchmark - say, TPC-DS - and evaluating
the impact?


Worthwhile, although obviously the impact will be a lot smaller,
since they're not entirely bottlenecked on hash-aggs and bitmap
scans.



Sure, the improvement won't be as significant as on the simple queries, 
but it's interesting IMHO.





I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places
switched to the new implementation (e.g. execGrouping merges the
duplicates to a single group, by definition).


Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?



Yep, that's pretty much what I suggested. I was wondering more about the 
other places where this hash table might be used - I've been thinking 
about hashjoins, but now that I think of it - that's a fairly 
specialized and tuned code. In any case, let's not complicate the 
discussion for now.





This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
  mandatory, but they do provide a quite consistent improvement. There
  are other threads where they proved to be beneficial, so I see little
  reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
  for the issue mentioned in [1], to improve peformance in heavily lossy
  scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates


While not quite perfect yet, I do think this is at a state where input
is needed.  I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.



So, is it the right time to do some benchmarking?


That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...



Hmmm ... not sure. If one of the points is to get rid of function calls 
determined at runtime (which make it impossible for the compiler to 
optimize the code), then I can think of three options:


(a) copy-paste the code for each place
(b) use some templating
(c) use JIT

I think (b) is way better than (a), and I don't think we're using JIT 
anywhere at this point. So while the macro-based templates look a bit 
awkward, I'm not aware of a better C thing.


A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very 

Re: [HACKERS] PATCH: two slab-like memory allocators

2016-10-01 Thread John Gorman
I reproduced the quadradic pfree performance problem and verified that
these patches solved it.

The slab.c data structures and functions contain no quadradic components.

I noticed the sizing loop in SlabContextCreate() and came up with
a similar formula to determine chunksPerBlock that you arrived at.

> Firstly, I've realized there's an issue when chunkSize gets too
> large - once it exceeds blockSize, the SlabContextCreate() fails
> as it's impossible to place a single chunk into the block. In
> reorderbuffer, this may happen when the tuples (allocated in
> tup_context) get larger than 8MB, as the context uses
> SLAB_LARGE_BLOCK_SIZE (which is 8MB).
>
> But maybe there's a simpler solution - we may simply cap the
> chunkSize (in GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because
> AllocSet handles those requests in a special way - for example
> instead of tracking them in freelist, those chunks got freed
> immediately.

I like this approach because it fixes the performance problems
with smaller allocations and doesn't change how larger
allocations are handled.

In slab.c it looks like a line in the top comments could be clearer.
Perhaps this is what is meant.

< *  (plus alignment), now wasting memory.
> *  (plus alignment), not wasting memory.

In slab.c some lines are over 80 characters could be folded.

It would be nice to give each patch version a unique file name.

Nice patch, I enjoyed reading it!

Best, John

John Gorman

On Mon, Sep 26, 2016 at 10:10 PM, Tomas Vondra  wrote:

> Hi,
>
> Attached is v2 of the patch, updated based on the review. That means:
>
> - Better comment explaining how free chunks are tracked in Slab context.
>
> - Removed the unused SlabPointerIsValid macro.
>
> - Modified the comment before SlabChunkData, explaining how it relates
>   to StandardChunkHeader.
>
> - Replaced the two Assert() calls with elog().
>
> - Implemented SlabCheck(). I've ended up with quite a few checks there,
>   checking pointers between the context, block and chunks, checks due
>   to MEMORY_CONTEXT_CHECKING etc. And of course, cross-checking the
>   number of free chunks (bitmap, freelist vs. chunk header).
>
> - I've also modified SlabContextCreate() to compute chunksPerBlock a
>   bit more efficiently (use a simple formula instead of the loop, which
>   might be a bit too expensive for large blocks / small chunks).
>
>
> I haven't done any changes to GenSlab, but I do have a few notes:
>
> Firstly, I've realized there's an issue when chunkSize gets too large -
> once it exceeds blockSize, the SlabContextCreate() fails as it's impossible
> to place a single chunk into the block. In reorderbuffer, this may happen
> when the tuples (allocated in tup_context) get larger than 8MB, as the
> context uses SLAB_LARGE_BLOCK_SIZE (which is 8MB).
>
> For Slab the elog(ERROR) is fine as both parameters are controlled by the
> developer directly, but GenSlab computes the chunkSize on the fly, so we
> must not let it fail like that - that'd result in unpredictable failures,
> which is not very nice.
>
> I see two ways to fix this. We may either increase the block size
> automatically - e.g. instead of specifying specifying chunkSize and
> blockSize when creating the Slab, specify chunkSize and chunksPerBlock (and
> then choose the smallest 2^k block large enough). For example with
> chunkSize=96 and chunksPerBlock=1000, we'd get 128kB blocks, as that's the
> closest 2^k block larger than 96000 bytes.
>
> But maybe there's a simpler solution - we may simply cap the chunkSize (in
> GenSlab) to ALLOC_CHUNK_LIMIT. That's fine, because AllocSet handles those
> requests in a special way - for example instead of tracking them in
> freelist, those chunks got freed immediately.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-01 Thread Thomas Munro
On Sun, Oct 2, 2016 at 9:34 AM, Vik Fearing  wrote:
> On 10/01/2016 09:28 AM, Thomas Munro wrote:
>> On Mon, Aug 29, 2016 at 1:26 PM, Vik Fearing  wrote:
>>> The attached two patches scratch two itches I've been having for a
>>> while.  I'm attaching them together because the second depends on the first.
>>>
>>> Both deal with the fact that [auto]vacuum has taken on more roles than
>>> its original purpose.
>>>
>>>
>>> Patch One: autovacuum insert-heavy tables
>>>
>>> If you have a table that mostly receives INSERTs, it will never get
>>> vacuumed because there are no (or few) dead rows.  I have added an
>>> "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly
>>> the same way as "changes_since_analyze" does.
>>>
>>> The reason such a table needs to be vacuumed is currently twofold: the
>>> visibility map is not updated, slowing down index-only scans; and BRIN
>>> indexes are not maintained, rendering them basically useless.
>>
>> I'm aware of those two problems, but not very familiar with the
>> details.  I don't feel qualified to say whether insert counting is the
>> best approach to the problem at this point.  I looked into it a little
>> bit however, and had the following thoughts:
>>
>> About BRIN indexes:  I couldn't find an explanation of why BRIN
>> indexes don't automatically create new summary tuples when you insert
>> a new tuple in an unsummarised page range.  Is it deferred until
>> VACUUM time in order to untangle some otherwise unresolvable
>> interlocking or crash safety problem, or could that one day be done?
>> Assuming that it must be deferred for some technical reason and there
>> is no way around it, then I wonder if there is a more direct and
>> accurate way to figure out when it's necessary than counting inserts.
>> Counting inserts seems slightly bogus because you can't tell whether
>> those were inserts into an existing summarised block which is
>> self-maintaining or not.  At first glance it looks a bit like
>> unsummarised ranges can only appear at the end of the table, is that
>> right?  If so, couldn't you detect the number of unsummarised BRIN
>> blocks just by comparing the highest summarised BRIN block and the
>> current heap size?
>>
>> About visibility maps:  How crazy would it be to estimate the number
>> of not-all-visible pages instead?  It would be less work to count that
>> since it would only increase when the *first* tuple is inserted into a
>> page that is currently all visible (ie when the bit is cleared), not
>> for every tuple inserted into any page like your inserts_since_vacuum
>> counter.  Another difference is that inserts_since_vacuum is reset
>> even if vacuum finds that it *can't* set the all-visible bit for a
>> given page yet because of some concurrent transaction.  In that case
>> the bit is still not set but autovacuum has no reason to be triggered
>> again.
>
> Sure, I could handle each case separately, but the goal of this patch
> (as hinted at in the Subject) is to generalize all the different tasks
> we've been giving to VACUUM.  The only missing piece is what the first
> patch addresses; which is insert-mostly tables never getting vacuumed.

Yeah, that makes sense.  I just wanted to discuss what the ideal
launch conditions would be for those particular ancillary jobs, and
then figure out whether the difference matters.  Generally, I think
changes to autovacuum heuristics need some consensus-building
discussion, especially in light of other related ideas from Jeff
Janes, and from people involved with BRIN and visibility map design,
including Simon who signed up as a reviewer.  Since we're out of time
I'm going to move this to the November CF, and let's hear from them.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Contains and is contained by operators of inet datatypes

2016-10-01 Thread Emre Hasegeli
Attached patch adds <@, @>, <<@, and @>> operator symbols for inet
datatype to replace <<=, >>=, <<, and >>.  <@ and @> symbols are used
for containment for all datatypes except inet, particularly on the
geometric types, arrays; cube, hstore, intaray, ltree extensions.

<@ and @> symbols are standardised as on version 8.2 by Tom Lane at
2006 [1].  The previous symbols are left in-place but deprecated.  The
patch does exactly the same for inet datatypes.

The << and >> are standard symbols for strictly left of and strictly
right of operators.  Those operators would also make sense for inet
datatypes.  If we make this change now; we can remove the symbols, and
reuse them for new operators in distant future.

The patch removes the recently committed SP-GiST index support for the
existing operator symbols to give move reason to the users to use the
new symbols.  This change will also indirectly deprecate the
undocumented non-transparent btree index support that works sometimes
for some of the existing operators [2].

The patch includes changes on the regression tests and the
documentation.  I will add it to 2016-11 Commitfest.

[1] https://www.postgresql.org/message-id/14277.1157304...@sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/389.1042992...@sss.pgh.pa.us


0001-inet-contain-op-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-01 Thread Vik Fearing
On 10/01/2016 09:28 AM, Thomas Munro wrote:
> On Mon, Aug 29, 2016 at 1:26 PM, Vik Fearing  wrote:
>> The attached two patches scratch two itches I've been having for a
>> while.  I'm attaching them together because the second depends on the first.
>>
>> Both deal with the fact that [auto]vacuum has taken on more roles than
>> its original purpose.
>>
>>
>> Patch One: autovacuum insert-heavy tables
>>
>> If you have a table that mostly receives INSERTs, it will never get
>> vacuumed because there are no (or few) dead rows.  I have added an
>> "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly
>> the same way as "changes_since_analyze" does.
>>
>> The reason such a table needs to be vacuumed is currently twofold: the
>> visibility map is not updated, slowing down index-only scans; and BRIN
>> indexes are not maintained, rendering them basically useless.
> 
> I'm aware of those two problems, but not very familiar with the
> details.  I don't feel qualified to say whether insert counting is the
> best approach to the problem at this point.  I looked into it a little
> bit however, and had the following thoughts:
> 
> About BRIN indexes:  I couldn't find an explanation of why BRIN
> indexes don't automatically create new summary tuples when you insert
> a new tuple in an unsummarised page range.  Is it deferred until
> VACUUM time in order to untangle some otherwise unresolvable
> interlocking or crash safety problem, or could that one day be done?
> Assuming that it must be deferred for some technical reason and there
> is no way around it, then I wonder if there is a more direct and
> accurate way to figure out when it's necessary than counting inserts.
> Counting inserts seems slightly bogus because you can't tell whether
> those were inserts into an existing summarised block which is
> self-maintaining or not.  At first glance it looks a bit like
> unsummarised ranges can only appear at the end of the table, is that
> right?  If so, couldn't you detect the number of unsummarised BRIN
> blocks just by comparing the highest summarised BRIN block and the
> current heap size?
> 
> About visibility maps:  How crazy would it be to estimate the number
> of not-all-visible pages instead?  It would be less work to count that
> since it would only increase when the *first* tuple is inserted into a
> page that is currently all visible (ie when the bit is cleared), not
> for every tuple inserted into any page like your inserts_since_vacuum
> counter.  Another difference is that inserts_since_vacuum is reset
> even if vacuum finds that it *can't* set the all-visible bit for a
> given page yet because of some concurrent transaction.  In that case
> the bit is still not set but autovacuum has no reason to be triggered
> again.

Sure, I could handle each case separately, but the goal of this patch
(as hinted at in the Subject) is to generalize all the different tasks
we've been giving to VACUUM.  The only missing piece is what the first
patch addresses; which is insert-mostly tables never getting vacuumed.

As for the second patch, I would like to withdraw it and redesign it,
based on your comments.  The redesign I have in mind will no longer be
dependent on the first patch, so I'll submit it separately.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Andres Freund
On 2016-10-01 20:04:18 +0100, Greg Stark wrote:
> On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund  wrote:
> >
> > Unfortunately I'm running out battery right now, so I don't want to
> > re-run the benchmarks posted upthread (loading takes a while). But the
> > last time I ran them all the results after the patches were better than
> > before.

> I have a machine sitting idle now too if you have specific ideas of
> what to benchmark.

Last time I just extracted interesting parts of queries from tpc-h (to
be able to look at bitmapscans/hash-aggs in isolation) and ran tpc-h as
a whole.  During development I also tried to come up with adverse cases
(e.g. huge bitmapscans, with very low work_mem).  I don't really have a
lot better ideas than that.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Andres Freund
Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:
> On 10/01/2016 02:44 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> > > In the attached patch I've attached simplehash.h, which can be
> > > customized by a bunch of macros, before being inlined.  There's also a
> > > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> > > execGrouping.c.
> > 
> > Attached is a significantly improved version of this series.  The main
> > changes are:
> > 
> > - The hash table now uses robin-hood style hash collision handling. See the
> >   commit message of 0002 (or simplehash.h) for an introduction. That
> >   significantly reduces the impact of "clustering" due to linear
> >   addressing.
> 
> Interesting. Have you looked at cuckoo hashing?

Yes. I don't think it's a good fit for modern CPUs. Because it doesn't
probe linearly you constantly have random uncached memory accesses.

I've played with a few schemes, and they all seemed to be slower than
linear probing deviations, unless you intentionally used a wrong
hash-distribution, or intentionally "unbalanced" the hash-space by
iterating over either end of the keyspace and moved the entries around.


> > - Significant comment and correctness fixes, both in simplehash.h
> > - itself, and 0003/0004.
> > - a lot of other random performance improvements for the hashing code.
> > 
> > 
> > Unfortunately I'm running out battery right now, so I don't want to
> > re-run the benchmarks posted upthread (loading takes a while). But the
> > last time I ran them all the results after the patches were better than
> > before.
> > 
> 
> Well, I have rather bad experience with running benchmarks on laptops anyway
> - a lot of noise due to power management, etc.

Well, with factor ~2x improvements thats not really a big
detractor. Using a new CPU makes some forms of analysis easier (better
PMUs).

For longrunning tests I agree.


> What about running a bigger benchmark - say, TPC-DS - and evaluating
> the impact?

Worthwhile, although obviously the impact will be a lot smaller, since
they're not entirely bottlenecked on hash-aggs and bitmap scans.


> I think a crucial part of the benchmarking will be identifying and measuring
> corner cases, e.g. a lot of rows with the same key, etc.
> Although that probably is not a major issue for the two places
> switched to the new implementation (e.g. execGrouping merges the
> duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group.  I don't really
see any case where repeated keys should cause an issue for a hash table?


> > This patch series currently consists out of four patches:
> > - 0001 - add unlikely/likely() macros. The use here is not entirely
> >   mandatory, but they do provide a quite consistent improvement. There
> >   are other threads where they proved to be beneficial, so I see little
> >   reason not to introduce them.
> > - 0002 - the customizable hashtable itself. It now contains documentation.
> > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
> >   for the issue mentioned in [1], to improve peformance in heavily lossy
> >   scenarios (otherwise we could regress in some extreme cases)
> > - 0004 - convert execGrouping.c, e.g. used by hash aggregates
> > 
> > 
> > While not quite perfect yet, I do think this is at a state where input
> > is needed.  I don't want to go a lot deeper into this rabbit hole,
> > before we have some agreement that this is a sensible course of action.
> > 
> 
> So, is it the right time to do some benchmarking?

That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Greg Stark
On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund  wrote:
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.


I have a machine sitting idle now too if you have specific ideas of
what to benchmark.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-01 Thread Greg Stark
On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas  wrote:
> For one thing, we can stop shipping a totally broken feature in release after 
> release

For what it's worth I'm for any patch that can accomplish that.

In retrospect I think we should have done the hash-over-btree thing
ten years ago but we didn't and if Amit's patch makes hash indexes
recoverable today then go for it.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Macro customizable hashtable / bitmapscan & aggregation perf

2016-10-01 Thread Tomas Vondra

On 10/01/2016 02:44 AM, Andres Freund wrote:

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined.  There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.


Attached is a significantly improved version of this series.  The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
  commit message of 0002 (or simplehash.h) for an introduction. That
  significantly reduces the impact of "clustering" due to linear
  addressing.


Interesting. Have you looked at cuckoo hashing? That seems to be the 
go-to hashing in several database-related papers I've read recently, so 
I guess there's a reason for that (IIRC it has constant worst case for 
lookups, constant amortized time for inserts/deletes). Not sure how it 
compares to robin-hood hashing regarding cache-friendliness though.



- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.


Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.



Well, I have rather bad experience with running benchmarks on laptops 
anyway - a lot of noise due to power management, etc. What about running 
a bigger benchmark - say, TPC-DS - and evaluating the impact?


I think a crucial part of the benchmarking will be identifying and 
measuring corner cases, e.g. a lot of rows with the same key, etc. 
Although that probably is not a major issue for the two places switched 
to the new implementation (e.g. execGrouping merges the duplicates to a 
single group, by definition).




This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
  mandatory, but they do provide a quite consistent improvement. There
  are other threads where they proved to be beneficial, so I see little
  reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
  for the issue mentioned in [1], to improve peformance in heavily lossy
  scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates


While not quite perfect yet, I do think this is at a state where input
is needed.  I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.



So, is it the right time to do some benchmarking?



I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] pg_upgrade from 9.5 to 9.6 fails with "invalid argument"

2016-10-01 Thread Tom Lane
Andrew Dunstan  writes:
> On 09/30/2016 12:24 PM, Tom Lane wrote:
>> Seems to be some additional prep work needed somewhere ...
>> No upgrade_install_root at 
>> /home/bfarm/bf-scripts/build-farm-4.17/PGBuild/Modules/TestUpgradeXversion.pm
>>  line 51.

> Oh sorry, you also need an entry for that in the config file. Mine has:
> upgrade_install_root => '/home/bf/bfr/root/upgrade',

Yesterday's runs of prairiedog had this turned on.  I'm not sure how
to interpret the output (because there isn't any, ahem), but it does
not appear that the test detected anything wrong.

I was able to demonstrate the problem, and that my patch of yesterday
fixed it, by building a table in 9.5 that was all-visible, deleting a few
widely scattered tuples, and then pg_upgrading.  It wasn't that easy to
exhibit a query failure, though --- my first attempt failed because I'd
done an index-only scan in 9.5 before upgrading, and that had the side
effect of marking the index tuples dead, making the same query in HEAD
produce correct answers despite having invalid vismap state.  If we wanted
a regression test, it seems like we would have to build a custom test
specifically designed to detect this type of problem, and I'm not sure
it's worth it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Show hash / bitmap sizes in EXPLAIN ANALYZE?

2016-10-01 Thread Tomas Vondra

On 10/01/2016 01:37 AM, Andres Freund wrote:

Hi,

At the moment in-memory sort and hash nodes show their memory usage in
explain:
│   ->  Sort  (cost=59.83..62.33 rows=1000 width=4) (actual time=0.512..0.632 
rows=1000 loops=1)│
│ Sort Key: a.a 
│
│ Sort Method: quicksort  Memory: 71kB  
│
│ ->  Function Scan on generate_series a  (cost=0.00..10.00 rows=1000 
width=4) (actual time=0.165..0.305 rows=1000 loops=1) │
and
│   ->  Hash  (cost=10.00..10.00 rows=1000 width=4) (actual time=0.581..0.581 
rows=1000 loops=1)│
│ Buckets: 1024  Batches: 1  Memory Usage: 44kB 
│

I think we should show something similar for bitmap scans, and for
some execGrouping.c users (at least hash aggregates, subplans and
setop seem good candidates too).



+1 to improve this


For both categories it's useful to see how close within work_mem a
scan ended up being (to understand how high to set it, and how much
the data can grow till work_mem is excceded), and for execGrouping.c
users it's also very interesting to see the actual memory usage
because the limit is only a very soft one.

Does anybody see a reason not to add that?



Well, the obvious problem with execGrouping.c is that we don't have 
information about memory usage - we don't know how large the aggregate 
state is. It's trivial to compute it for aggregates that use 
fixed-length data types, but for aggregates that use varlena/internal 
state that's not going to work.


This is actually the same problem Jeff Davis ran into when trying to 
implement memory-bounded HashAgg ~2 years ago, which also needs this 
information. Back then there was a lot of discussion about whether the 
~1% penalty measured is acceptable price for the accounting, which kinda 
killed the whole patch.


I plan to revisit that hashagg patch, or rather a new patch with the 
same goal - now that we have serial/deserial functions for aggregates, 
we should be able to implement much nicer spill-to-disk method. But 
that'll need the memory accounting, so if you want to look into it, 
you're welcome.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Indexes

2016-10-01 Thread ktm

Andres Freund :


On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote:

On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas  wrote:
> I would just be very disappointed if, after the amount of work that
> Amit and others have put into this project, the code gets rejected
> because somebody thinks a different project would have been more worth
> doing.

I wouldn't presume to tell anyone else how to spend their time, and am
not concerned about this making the hash index code any less useful
from the user's perspective.


Me neither.

I'm concerned that this is a heck of a lot of work, and I don't think
we've reached the end of it by a good bit. I think it would have, and
probably still is, a more efficient use of time to go for the
hash-via-btree method, and rip out the current hash indexes.  But that's
just me.

I find it more than a bit odd to be accused of trying to waste others
time by saying this, and that this is too late because time has already
been invested. Especially the latter never has been a standard in the
community, and while excruciatingly painful when one is the person(s)
having invested the time, it probably shouldn't be.



> The fact that we have hash indexes already and cannot remove them
> because too much other code depends on hash opclasses is also, in my
> opinion, a sufficiently good reason to pursue improving them.

I think that Andres was suggesting that hash index opclasses would be
usable with hash-over-btree, so you might still not end up with the
wart of having hash opclasses without hash indexes (an idea that has
been proposed and rejected at least once before now). Andres?


Yes, that was what I was pretty much thinking. I was kind of guessing
that this might be easiest implemented as a separate AM ("hash2" ;))
that's just a layer ontop of nbtree.

Greetings,

Andres Freund


Hi,

There have been benchmarks posted over the years were even the non-WAL  
logged hash out performed the btree usage variant. You cannot argue  
against O(1) algorithm behavior. We need to have a usable hash index  
so that others can help improve it.


My 2 cents.

Regards,
Ken




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] parallel & isolated makefile for plpython

2016-10-01 Thread Tom Lane
Pavel Raiskup  writes:
> Hi, we observed issues with parallel make during RPM build in plpython,
> seems like the attached patch 0002 should help.  Feel free to reject 0001,
> but comment like that would save some time to me as a "newcomer" into that
> Makefile.

Hi Pavel!

> The submake-generated-headers can't be used as 'all' prerequisite
> at the same level with with 'all-lib', because
> 'submake-generated-headers' is actually prerequisite for
> 'all-libs' in this makefile.  So use submake-generated-headers as
> $(OBJS) prerequisite, as several object files depend on it.

D'oh.  This is my fault, IIRC.  Will fix.

> Move the 'all' target before include statement, according to
> documentation in Makefile.shlib.

Hm, actually that's unnecessary because Makefile.global already
established 'all' as the default target.  I'm inclined to think
that the comment in Makefile.shlib is wrong and should be removed
or at least rewritten, because I doubt it would work at all to
include Makefile.shlib before Makefile.global; certainly that's
not the intended usage.

> ! # gram.y=>gram.c.  Run `info make -n "Empty Recipes"` for more info.
  
I'm okay with pointing people at that part of the gmake docs, but don't
care for insisting on exactly one way of accessing those docs.  How
about something like "... See "Empty Recipes" in the GNU Make docs for
more info."?

[ checks around... ]  Wait, actually there's a bigger problem: this
advice is apparently gmake-version-specific.  I don't see any
such section title in RHEL6's version of the gmake docs (make 3.81).
I do find a section titled "Using Empty Commands", which looks like
it might be the same thing, but a person looking for "Empty Recipes" is
not going to find that.  Is it worth the trouble to provide this pointer
if we need a lot of weasel wording about how the section name is spelled?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench more operators & functions

2016-10-01 Thread Fabien COELHO


Hello Stephen,


bitwise: <<, >>, &, |, ^/#, ~
comparisons: =/==, <>/!=, <, <=, >, >=
logical: and/&&, or/||, xor/^^, not, !


I'm not sure that we want to introduce operators '&&', '||' as logical
'and' and 'or' when those have specific meaning in PG which is different
(array overlaps and concatenation, specifically).  I can certainly see
how it could be useful to be able to perform string concatenation in a
\set command in pgbench, for example..

Also, are we sure that we want to have both '=' and '==' for comparison?


The reason I added C operators is that Pg SQL has '!=' and a few others, 
so once some are there I do not see why others should not be there as well 
for consistency.


Now I agree that having operators which behave differently from psql to 
pgbench is a bad idea.


In the attached patched I only included pg operators, plus "xor" which I 
feel is missing and does not seem to harm.



[...] I certainly understand how the if() function could be useful


Indeed, some kind of "if" is needed, for instance to implement "tpc-b" 
correctly.


, but I'm not entirely sure that the if(expression, true-result, 
false-result) is the best approach.  In particular, I recall cases with 
other languages where that gets to be quite ugly when there are multiple 
levels of if/else using that approach.


I think that pgbench is not a programming language, but an expression 
language, which means that you have to provide a result, so you can only 
have balanced if/then/else, which simplifies things a bit compared to 
"full" language.


The SQL syntax for CASE is on the very heavy side and would be quite 
complicated to implement in pgbench, so I rejected that and selected the 
simplest possible function for the job.


Maybe the "if" function could be named something fancier to avoid possible 
future clash on an eventual "if" keyword? Note that excel has an IF 
function. Maybe "conditional"... However, as I do not envision pgbench 
growing to a full language, I would be fine keeping "if" as it is.


The table should really have a row per operator, which is what we do in 
pretty much all of the core documention. [...]


Ok for one line per operator.


The question which was brought up about having a line-continuation
(eg: '\') character would be useful,


:-) I submitted a patch for that, which got rejected and resulted in Tom 
making pgbench share psql lexer. This is a good thing, although the 
continuation extension was lost in the process. Also this means that now 
such logic would probably be shared with psql, not necessarily a bad thing 
either, but clearly material for another patch.


which really makes me wonder if we shouldn't try to come up with a way 
to create functions in a pgbench script that can later be used in a \set 
command.


I do not think that pgbench script is a good target to define functions, 
because the script is implicitely in a very large loop which is executing 
it over and over again. I do not think that it would make much sense to 
redefine functions on each transaction... So my opinion is that without a 
compeling use case, I would avoid such a feature, which would need some 
careful thinking on the design side, and the implementation of which is 
non trivial.


With such an approach, we could have proper control structures 
for conditionals (if/else), loops (while/for), etc, and not complicate 
the single-statement set of operators with those constructs.


If you want control, you can already use a DO & PL/pgsql script, although 
it is interpreted server-side. Yet again, I'm not convince that pgbench is 
material for such features, and it would take a compeling use case for me 
to add such a thing. Moreover, the currently simple internal data 
structures and executor would have to be profoundly reworked and extended 
to handle a full language and functions.



Lastly, we should really add some regression tests to pgbench.  We
already have the start of a TAP test script which it looks like it
wouldn't be too hard to add on to.


I agree that pgbench should be tested systematically. I think that this is 
not limited to these functions and operators but a more general and 
desirable feature, thus is really material for another patch.


Attached version changes:
 - removes C operators not present in psql
 - document operators one per line

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 285608d..c29966f 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -821,9 +821,8 @@ pgbench  options  dbname
   The expression may contain integer constants such as 5432,
   double constants such as 3.14159,
   references to variables :variablename,
-  unary operators (+, -) and binary operators
-  (+, -, *, /,
-  %) with their usual precedence and associativity,
+  operators
+  with their usual precedence and associativity,
   function calls, and
   parentheses.
  
@@ -909,6 

Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-10-01 Thread Michael Paquier
On Sat, Oct 1, 2016 at 5:08 AM, Thomas Munro
 wrote:
> I guess you need something involving query_tree_walker or some other
> kind of recursive traversal if you want to find DELETE/UPDATE lurking
> in there.
>
> One option would be to document it as working for top level DELETE or
> UPDATE, and leave the recursive version as an improvement for a later
> patch.  That's the most interesting kind to catch because that's what
> people are most likely to type directly into a command line.

That would be a halfy-baked feature then, and the patch would finish
by being refactored anyway if we support more cases in the future,
because those will need a tree walker (think CTAS, INSERT SELECT,
using WITH queries that contain DMLs)... Personally I think that it
would be surprising if subqueries are not restrained. So I am -1 for
the patch as-is, and let's come up with the most generic approach.

Having more regression tests would be a good idea as well. I am
marking the patch as returned with feedback. This CF has normally
already ended.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] COPY command with RLS bug

2016-10-01 Thread Michael Paquier
On Sat, Oct 1, 2016 at 3:11 AM, Stephen Frost  wrote:
> Comments and testing welcome, of course, though it's looking pretty good
> to me at this point and I'll likely commit it in another day or two
> unless issues are found.

+* nodes/value.h) that correspond to the column name
"correspondS"?

Except that, it looks good to me.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-01 Thread Peter Geoghegan
On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik
 wrote:
> So the question is whether it is correct that ExecOnConflictUpdate tries to
> access and update tuple without holding lock on the buffer?

You're right -- this is a bug in Postgres.

I'm travelling from Ireland to the USA this weekend, but will work on
this early next week. I don't think it's a particularly tricky fix --
as you say, it is necessary to have at least a shared buffer lock to
call HeapTupleSatisfiesVisibility(), and we quite simply fail to
ensure that. We must have a shared buffer lock in the visibility-check
path for ON CONFLICT DO UPDATE where isolation level > READ COMMITTED
-- a buffer pin is not enough.

It also looks like the DO NOTHING variant is similarly affected, even
when the isolation level is READ COMMITTED.:-(

Thanks for the detailed report.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Make it possible to disallow WHERE-less UPDATE and DELETE

2016-10-01 Thread Michael Paquier
On Sat, Oct 1, 2016 at 5:08 AM, Thomas Munro
 wrote:
> Right.  These cases work because they show up as CMD_DELETE/CMD_UPDATE:
>
> postgres=# set require_where.delete = on;
> SET
> postgres=# with answer as (select 42) delete from foo;
> ERROR:  DELETE requires a WHERE clause when require_where.delete is set to on
> HINT:  To delete all rows, use "WHERE true" or similar.
> postgres=# prepare x as delete from foo;
> ERROR:  DELETE requires a WHERE clause when require_where.delete is set to on
> HINT:  To delete all rows, use "WHERE true" or similar.

Is this patch able to handle the case of DML queries using RETURNING
in COPY? Those are authorized since 9.6, and it has not been mentioned
yet on this thread. Going through the patch quickly I guess that would
not work.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM's ancillary tasks

2016-10-01 Thread Thomas Munro
On Mon, Aug 29, 2016 at 1:26 PM, Vik Fearing  wrote:
> The attached two patches scratch two itches I've been having for a
> while.  I'm attaching them together because the second depends on the first.
>
> Both deal with the fact that [auto]vacuum has taken on more roles than
> its original purpose.
>
>
> Patch One: autovacuum insert-heavy tables
>
> If you have a table that mostly receives INSERTs, it will never get
> vacuumed because there are no (or few) dead rows.  I have added an
> "inserts_since_vacuum" field to PgStat_StatTabEntry which works exactly
> the same way as "changes_since_analyze" does.
>
> The reason such a table needs to be vacuumed is currently twofold: the
> visibility map is not updated, slowing down index-only scans; and BRIN
> indexes are not maintained, rendering them basically useless.

I'm aware of those two problems, but not very familiar with the
details.  I don't feel qualified to say whether insert counting is the
best approach to the problem at this point.  I looked into it a little
bit however, and had the following thoughts:

About BRIN indexes:  I couldn't find an explanation of why BRIN
indexes don't automatically create new summary tuples when you insert
a new tuple in an unsummarised page range.  Is it deferred until
VACUUM time in order to untangle some otherwise unresolvable
interlocking or crash safety problem, or could that one day be done?
Assuming that it must be deferred for some technical reason and there
is no way around it, then I wonder if there is a more direct and
accurate way to figure out when it's necessary than counting inserts.
Counting inserts seems slightly bogus because you can't tell whether
those were inserts into an existing summarised block which is
self-maintaining or not.  At first glance it looks a bit like
unsummarised ranges can only appear at the end of the table, is that
right?  If so, couldn't you detect the number of unsummarised BRIN
blocks just by comparing the highest summarised BRIN block and the
current heap size?

About visibility maps:  How crazy would it be to estimate the number
of not-all-visible pages instead?  It would be less work to count that
since it would only increase when the *first* tuple is inserted into a
page that is currently all visible (ie when the bit is cleared), not
for every tuple inserted into any page like your inserts_since_vacuum
counter.  Another difference is that inserts_since_vacuum is reset
even if vacuum finds that it *can't* set the all-visible bit for a
given page yet because of some concurrent transaction.  In that case
the bit is still not set but autovacuum has no reason to be triggered
again.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] parallel & isolated makefile for plpython

2016-10-01 Thread Pavel Raiskup
Hi, we observed issues with parallel make during RPM build in plpython,
seems like the attached patch 0002 should help.  Feel free to reject 0001,
but comment like that would save some time to me as a "newcomer" into that
Makefile.

Pavel
>From b8722c8fb1e3d5f752d75e4d0740d04793577185 Mon Sep 17 00:00:00 2001
From: Pavel Raiskup 
Date: Fri, 30 Sep 2016 14:26:24 +0200
Subject: [PATCH 1/2] Document that "Empty Recipe" is standard thing in GNU
 make

---
 src/backend/parser/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/parser/Makefile b/src/backend/parser/Makefile
new file mode 100644
index fdd8485..92acc00
*** a/src/backend/parser/Makefile
--- b/src/backend/parser/Makefile
*** endif
*** 31,37 
  # shorthand for two otherwise separate rules.  To be safe for parallel
  # make, we must chain the dependencies like this.  The semicolon is
  # important, otherwise make will choose the built-in rule for
! # gram.y=>gram.c.
  
  gram.h: gram.c ;
  
--- 31,37 
  # shorthand for two otherwise separate rules.  To be safe for parallel
  # make, we must chain the dependencies like this.  The semicolon is
  # important, otherwise make will choose the built-in rule for
! # gram.y=>gram.c.  Run `info make -n "Empty Recipes"` for more info.
  
  gram.h: gram.c ;
  
-- 
2.7.4

>From dbd3a57775e3b177d1a9c974d89862ab8a24b9f1 Mon Sep 17 00:00:00 2001
From: Pavel Raiskup 
Date: Sat, 1 Oct 2016 07:53:29 +0200
Subject: [PATCH 2/2] Make makefile for plpython really self-standing

The submake-generated-headers can't be used as 'all' prerequisite
at the same level with with 'all-lib', because
'submake-generated-headers' is actually prerequisite for
'all-libs' in this makefile.  So use submake-generated-headers as
$(OBJS) prerequisite, as several object files depend on it.

Move the 'all' target before include statement, according to
documentation in Makefile.shlib.

This is follow-up for 548af97fcec5543603c20b61fec60f8147a05b29.
---
 src/pl/plpython/Makefile | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/pl/plpython/Makefile b/src/pl/plpython/Makefile
new file mode 100644
index 647b4b1..3a8b2c6
*** a/src/pl/plpython/Makefile
--- b/src/pl/plpython/Makefile
*** REGRESS = \
*** 93,102 
  
  REGRESS_PLPYTHON3_MANGLE := $(REGRESS)
  
! include $(top_srcdir)/src/Makefile.shlib
  
! all: submake-generated-headers all-lib
  
  
  install: all install-lib install-data
  
--- 93,103 
  
  REGRESS_PLPYTHON3_MANGLE := $(REGRESS)
  
! $(OBJS): | submake-generated-headers
  
! all: all-lib
  
+ include $(top_srcdir)/src/Makefile.shlib
  
  install: all install-lib install-data
  
-- 
2.7.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers