Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-11-03 Thread Tomas Vondra
Hi,

On 11/02/2017 06:45 PM, Alvaro Herrera wrote:
> Tomas Vondra wrote:
> 
>> Unfortunately, I think we still have a problem ... I've been wondering
>> if we end up producing correct indexes, so I've done a simple test.
> 
> Here's a proposed patch that should fix this problem (and does, in my
> testing).  Would you please give it a try?
> 
> This patch changes two things:
> 
> 1. in VACUUM or brin_summarize_new_values, we only process fully loaded
> ranges, and ignore the partial range at end of table.
> 
> 2. when summarization is requested on the partial range at the end of a
> table, we acquire extension lock on the rel, then compute relation size
> and run summarization with the lock held.  This guarantees that we don't
> miss any pages.  This is bad for concurrency though, so it's only done
> in that specific scenario.
> 

FWIW this patch fixes the issue for me - I can no longer reproduce the
bitmapscan vs. seqscan result discrepancies (even with the extra UPDATE
phase).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-31 Thread Tomas Vondra


On 10/31/2017 11:44 PM, Tomas Vondra wrote:
> ...
> Unfortunately, I think we still have a problem ... I've been wondering
> if we end up producing correct indexes, so I've done a simple test.
> 
> 1) create the table as before
> 
> 2) let the insert + vacuum run for some time, to see if there are
> crashes (result: no crashes after one hour, inserting ~92M rows)
> 
> 3) do a bunch of random updates on the data (while still doing the
> concurrent vacuum in another session)
> 
> 4) run a bunch of simple queries to compare the results, essentially
> 
>-- BRIN index
>SET enable_bitmapscan = on;
>SELECT COUNT(*) FROM brin_test WHERE a = $1;
> 
> 
>-- seq scan
>SET enable_bitmapscan = on;
>SELECT COUNT(*) FROM brin_test WHERE a = $1;
> 
> and unfortunately what I get is not particularly pleasant:
> 
> test=# set enable_bitmapscan = on;
> SET
> test=# select count(*) from brin_test where a = 0;
>  count
> ---
>   9062
> (1 row)
> 
> test=# set enable_bitmapscan = off;
> SET
> test=# select count(*) from brin_test where a = 0;
>  count
> ---
>   9175
> (1 row)
> 
> Attached is a SQL script with commands I used. You'll need to copy the
> commands into multiple psql sessions, though, to simulate concurrent
> activity).
> 

FWIW I can reproduce this on 9.5, and I don't even need to run the
UPDATE part. That is, INSERT + VACUUM running concurrently is enough to
produce broken BRIN indexes :-(


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-31 Thread Tomas Vondra
Hi,

On 10/31/2017 08:46 PM, Tom Lane wrote:
> I wrote:
>> maybe
>> we just have some run-of-the-mill bugs to find, like the off-the-end
>> bug I spotted in brin_doupdate.  There's apparently at least one
>> more, but given the error message it must be something like not
>> checking for a page to have turned into a revmap page.  Shouldn't
>> be too hard to find...
> 
> Actually, I think it might be as simple as the attached.
> brin_getinsertbuffer checks for the old page having turned into revmap,
> but the "samepage" path in brin_doupdate does not :-(
> 
> With this applied, Alvaro's version of the test case has survived
> without error for quite a bit longer than its former MTBF.  There
> might still be some issues though in other code paths.
> 

That does fix the crashes for me - I've been unable to reproduce any
even after one hour (it took a couple of minutes to crash before).

Unfortunately, I think we still have a problem ... I've been wondering
if we end up producing correct indexes, so I've done a simple test.

1) create the table as before

2) let the insert + vacuum run for some time, to see if there are
crashes (result: no crashes after one hour, inserting ~92M rows)

3) do a bunch of random updates on the data (while still doing the
concurrent vacuum in another session)

4) run a bunch of simple queries to compare the results, essentially

   -- BRIN index
   SET enable_bitmapscan = on;
   SELECT COUNT(*) FROM brin_test WHERE a = $1;


   -- seq scan
   SET enable_bitmapscan = on;
   SELECT COUNT(*) FROM brin_test WHERE a = $1;

and unfortunately what I get is not particularly pleasant:

test=# set enable_bitmapscan = on;
SET
test=# select count(*) from brin_test where a = 0;
 count
---
  9062
(1 row)

test=# set enable_bitmapscan = off;
SET
test=# select count(*) from brin_test where a = 0;
 count
---
  9175
(1 row)

Attached is a SQL script with commands I used. You'll need to copy the
commands into multiple psql sessions, though, to simulate concurrent
activity).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


brin-test.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2017-10-31 Thread Tomas Vondra
Hi,

On 10/20/2017 03:23 PM, Robert Haas wrote:
>
> ...
> 
> The main points I want to make clearly understood is the current
> design relies on (1) functions being labeled correctly and (2) other
> dangerous code paths being unreachable because there's nothing that
> runs between EnterParallelMode and ExitParallelMode which could invoke
> them, except by calling a mislabeled function.  Your patch expands the
> vulnerability surface from "executor code that can be reached without
> calling a mislabeled function" to "any code that can be reached by
> typing an SQL command".  Just rejecting any queries that are
> parallel-unsafe probably closes a good chunk of the holes, but that
> still leaves a lot of code that's never been run in parallel mode
> before potentially now running in parallel mode - e.g. any DDL command
> you happen to type, transaction control commands, code that only runs
> when the server is idle like idle_in_transaction_timeout, cursor
> operations.  A lot of that stuff is probably fine, but it's got to be
> thought through.  Error handling might be a problem, too: what happens
> if a parallel worker is killed while the query is suspended?  I
> suspect that doesn't work very nicely at all.
> 

OK, understood and thanks for explaining what may be the possible
issues. I do appreciate that.

I still think it'd be valuable to support this, though, so I'm going to
spend more time on investigating what needs to be handled.

But maybe there's a simpler option - what if we only allow fetches from
the PARALLEL cursor while the cursor is open? That is, this would work:

BEGIN;
...
DECLARE x PARALLEL CURSOR FOR SELECT * FROM t2 WHERE ...;
FETCH 1000 FROM x;
FETCH 1000 FROM x;
FETCH 1000 FROM x;
CLOSE x;
...
COMMIT;

but adding any other command between the OPEN/CLOSE commands would fail.
That should close all the holes with parallel-unsafe stuff, right?

Of course, this won't solve the issue with error handling / killing
suspended workers (which didn't occur to me before as a possible issue
at all, so that's for pointing that out). But that's a significantly
more limited issue to fix than all the parallel-unsafe bits.

Now, I agree this is somewhat more limited than I hoped for, but OTOH it
still solves the issue I initially aimed for (processing large query
results in efficient way).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?

2017-10-31 Thread Tomas Vondra
Hi,

On 10/31/2017 04:48 PM, Greg Stark wrote:
> On 31 October 2017 at 07:05, Chris Travers <chris.trav...@adjust.com>
wrote:
>> Hi;
>>
>> After Andres's excellent talk at PGConf we tried benchmarking
>> effective_io_concurrency on some of our servers and found that those
which
>> have a number of NVME storage volumes could not fill the I/O queue
even at
>> the maximum setting (1000).
>
> And was the system still i/o bound? If the cpu was 100% busy then
> perhaps Postgres just can't keep up with the I/O system. It would
> depend on workload though, if you start many very large sequential
> scans you may be able to push the i/o system harder.
>
> Keep in mind effective_io_concurrency only really affects bitmap
> index scans (and to a small degree index scans). It works by issuing
> posix_fadvise() calls for upcoming buffers one by one. That gets
> multiple spindles active but it's not really going to scale to many
> thousands of prefetches (and effective_io_concurrency of 1000
> actually means 7485 prefetches). At some point those i/o are going
> to start completing before Postgres even has a chance to start
> processing the data.
>
Yeah, initiating the prefetches is not expensive, but it's not free
either. So there's a trade-off between time spent on prefetching and
processing the data.

I believe this may be actually illustrated using Amdahl's law - the I/O
is the parallel part, and processing the data is the serial part. And no
matter what you do, the device only has so much bandwidth, which defines
the maximum possible speedup (compared to "no prefetch" case).

Furthermore, the device does not wait for all the I/O requests to be
submitted - it won't wait for 1000 requests and then go "OMG! There's a
lot of work to do!" It starts processing the requests as they arrive,
and some of them will complete before you're done with submitting the
rest, so you'll never see all the requests in the queue at once.

And of course, iostat and other tools only give you "average queue
length", which is mostly determined by the average throughput.

In my experience (on all types of storage, including SSDs and NVMe), the
performance quickly and significantly improves once you start increasing
the value (say, to 8 or 16, maybe 64). And then the gains become much
more modest - not because the device could not handle more, but because
of the prefetch/processing ratio reached the optimal value.

But all this is actually per-process, if you can run multiple backends
(particularly when doing bitmap index scans), I'm sure you'll see the
queues being more full.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN multi-range indexes

2017-10-30 Thread Tomas Vondra
Hi,

attached is a patch series that includes both the BRIN multi-range
minmax indexes discussed in this thread, and the BRIN bloom indexes
initially posted in [1].

It seems easier to just deal with a single patch series, although we may
end up adding just one of those proposed opclasses.

There are 4 parts:

0001 - Modifies bringetbitmap() to pass all scan keys to the consistent
function at once. This is actually needed by the multi-minmax indexes,
but not really required for the others.

I'm wondering if this is a safechange, considering it affects the BRIN
interface. I.e. custom BRIN opclasses (perhaps in extensions) will be
broken by this change. Maybe we should extend the BRIN API to support
two versions of the consistent function - one that processes scan keys
one by one, and the other one that processes all of them at once.

0002 - Adds BRIN bloom indexes, along with opclasses for all built-in
data types (or at least those that also have regular BRIN opclasses).

0003 - Adds BRIN multi-minmax indexes, but only with float8 opclasses
(which also includes timestamp etc.). That should be good enough for
now, but adding support for other data types will require adding some
sort of "distance" operator which is needed for merging ranges (to pick
the two "closest" ones). For float8 it's simply a subtraction.

0004 - Moves dealing with IS [NOT] NULL search keys from opclasses to
bringetbitmap(). The code was exactly the same in all opclasses, so
moving it to bringetbitmap() seems right. It also allows some nice
optimizations where we can skip the consistent() function entirely,
although maybe that's useless. Also, maybe the there are opclasses that
actually need to deal with the NULL values in consistent() function?


regards


[1]
https://www.postgresql.org/message-id/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip


0002-BRIN-bloom-indexes.patch.gz
Description: application/gzip


0003-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip


0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-30 Thread Tomas Vondra


On 10/30/2017 09:03 PM, Tom Lane wrote:
> Michael Paquier <michael.paqu...@gmail.com> writes:
>> b1328d78f is close enough, but per what I see that's actually not
>> true. I have been able to reproduce the problem, which shows up within
>> a window of 2-4 minutes. Still sometimes it can take way longer, and I
>> spotted one test where it took 15 minutes to show up... So, bisecting
>> with a test that looks for core files for 20 minutes, I am seeing that
>> the following commit is actually at fault:
> 
>> commit 24992c6db9fd40f342db1f22747ec9e56483796d
>> Author: Tom Lane <t...@sss.pgh.pa.us>
>> Date:   Fri Sep 9 19:00:59 2016 -0400
> 
>> Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple.
> 
> [ scratches head ... ]  Not sure how that could've introduced any
> problems?  Will look.
> 

Not sure, but I can confirm Michael's findings - I've been unable to
reproduce the issue on 1a4be103a5 even after 20 minutes, and on
24992c6db9 it failed after only 2.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-29 Thread Tomas Vondra
FWIW I can reproduce this on REL_10_STABLE, but not on REL9_6_STABLE. So
it seems to be due to something that changed in the last release.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM

2017-10-29 Thread Tomas Vondra
Hi,

while doing some weekend hacking & testing on the BRIN patches I posted
recently, I ran into PANICs in VACUUM, when it summarizes data inserted
concurrently (in another session):

PANIC:  invalid index offnum: 186

Initially I thought it's a bug in my patches, but apparently it's not
and I can reproduce it easily on current master (846fcc8516).

I'm not sure if/how this is related to the BRIN autosummarization issue
reported by Justin Pryzby about two weeks ago (thread [1]), but I don't
see any segfaults and the messages are always exactly the same.

[1]
https://www.postgresql.org/message-id/flat/20171014035732.GB31726%40telsasoft.com

Reproducing the issue is simple:

create table brin_test (a int, b bigint, c float,
d double precision, e uuid);
create index on brin_test using brin (a);
create index on brin_test using brin (b);
create index on brin_test using brin (c);
create index on brin_test using brin (d);
create index on brin_test using brin (e);

and then run

insert into brin_test select
 mod(i,10)/25,
 mod(i,10)/25,
 mod(i,10)/25,
 mod(i,10)/25,
md5((mod(i,10)/25)::text)::uuid
from generate_series(1,10) s(i) \watch 0.1

vacuum brin_test \watch 1

And sooner or later (for me it only takes a minute or two in most cases)
the VACUUM session should fail with the PANIC message mentioned above.
It always fails with the message, only the value (offset) changes.

The stack trace always looks exactly the same - see the attachment.

At first it seemed the idxrel is always the index on 'e' (i.e. the UUID
column), but it seems I also got failures on the other indexes.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Core was generated by `postgres: postgres test [local] VACUUM   
 '.
Program terminated with signal SIGABRT, Aborted.
#0  0x7fcadc413167 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7fcadc413167 in raise () from /lib64/libc.so.6
#1  0x7fcadc4145ea in abort () from /lib64/libc.so.6
#2  0x00a2409f in errfinish (dummy=0) at elog.c:557
#3  0x00a26734 in elog_finish (elevel=20, fmt=0xc2ee3d "invalid index 
offnum: %u") at elog.c:1378
#4  0x008a4e86 in PageIndexTupleDeleteNoCompact (page=0x7fcad4595080 
"\004", offnum=292) at bufpage.c:982
#5  0x00476b0c in brin_doupdate (idxrel=0x7fcad33700d8, 
pagesPerRange=128, revmap=0x21657f0, heapBlk=1676160, oldbuf=1442, oldoff=292, 
origtup=0x2169518, origsz=8, newtup=0x2165dc0, newsz=24, samepage=0 '\000') at 
brin_pageops.c:246
#6  0x00475b34 in summarize_range (indexInfo=0x2165940, 
state=0x21673d0, heapRel=0x7fcad336f050, heapBlk=1676160, heapNumBlks=1676189) 
at brin.c:1199
#7  0x00475ddd in brinsummarize (index=0x7fcad33700d8, 
heapRel=0x7fcad336f050, pageRange=4294967295, numSummarized=0x2166e20, 
numExisting=0x2166e20) at brin.c:1306
#8  0x00474e7d in brinvacuumcleanup (info=0x7ffdf5c28d70, 
stats=0x2166e10) at brin.c:794
#9  0x004f85d2 in index_vacuum_cleanup (info=0x7ffdf5c28d70, stats=0x0) 
at indexam.c:772
#10 0x006c59c1 in lazy_cleanup_index (indrel=0x7fcad33700d8, stats=0x0, 
vacrelstats=0x21656c0) at vacuumlazy.c:1651
#11 0x006c48f7 in lazy_scan_heap (onerel=0x7fcad336f050, options=1, 
vacrelstats=0x21656c0, Irel=0x2165550, nindexes=5, aggressive=0 '\000') at 
vacuumlazy.c:1334
#12 0x006c286e in lazy_vacuum_rel (onerel=0x7fcad336f050, options=1, 
params=0x7ffdf5c29450, bstrategy=0x218c710) at vacuumlazy.c:258
#13 0x006c2326 in vacuum_rel (relid=16385, relation=0x209ecd0, 
options=1, params=0x7ffdf5c29450) at vacuum.c:1543
#14 0x006c0942 in vacuum (options=1, relations=0x218c888, 
params=0x7ffdf5c29450, bstrategy=0x218c710, isTopLevel=1 '\001') at vacuum.c:340
#15 0x006c0455 in ExecVacuum (vacstmt=0x209edc0, isTopLevel=1 '\001') 
at vacuum.c:141
#16 0x008b4cb3 in standard_ProcessUtility (pstmt=0x209f110, 
queryString=0x209e2b0 "vacuum brin_test ", context=PROCESS_UTILITY_TOPLEVEL, 
params=0x0, queryEnv=0x0, dest=0x209f208, completionTag=0x7ffdf5c298d0 "") at 
utility.c:675
#17 0x008b4413 in ProcessUtility (pstmt=0x209f110, 
queryString=0x209e2b0 "vacuum brin_test ", context=PROCESS_UTILITY_TOPLEVEL, 
params=0x0, queryEnv=0x0, dest=0x209f208, completionTag=0x7ffdf5c298d0 "") at 
utility.c:357
#18 0x008b33b4 in PortalRunUtility (portal=0x211b9f0, pstmt=0x209f110, 
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x209f208, 
completionTag=0x7ffdf5c298d0 "") at pquery.c:1178
#19 0x008b35cc in PortalRunMulti (portal=0x211b9f0, isTopLevel=1 
'\001', setHoldSnapshot=0 '\000', dest=0x209f208, altdest=0x209f208, 
completionTag=0x7ffdf5c298d0 "") at pquery.c:1324
#20 0x0

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
hi,

On 10/28/2017 02:41 AM, Nico Williams wrote:
> On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote:
>>> + * We use an optimisation that initially we store the uint32 values 
>>> directly,
>>> + * without the extra hashing step. And only later filling the bitmap space,
>>> + * we switch to the regular bloom filter mode.
>>>
>>> I don't think that optimization is worthwhile.  If I'm going to be using
>>> a Bloom filter, it's because I expect to have a lot of rows.
>>>
>>> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
>>> table just won't have lots of rows, then I might want this optimization,
>>> but I can always just drop the Bloom filter index, or not include
>>> indexes in the LIKE.)
>>
>> I think you're confusing "rows" and "distinct values". Furthermore, it's
>> rather irrelevant how many rows are in the table because BRIN indexes
>> split the table into "ranges" that are 1MB by default. And you can only
>> squash certain number of rows into such range.
>>
>> The idea of the optimization is to efficiently support cases where each
>> range contains only small number of distinct values - which is quite
>> common in the cases I described in my initial message (with bursts of
>> rows with the same IP / client identifier etc.).
> 
> What range?  It's just bits to set.
> 
> The point is that Bloom filters should ideally be about 50% full -- much
> less than that and you are wasting space, much more than than and your
> false positive rate becomes useless.

The range defined by BRIN indexes. BRIN indexes split the table into a
sequence of page ranges (128 pages by default, i.e. 1MB), and the bloom
filters are built on those table "chunks". So it's irrelevant how many
rows are in the table in total, what matters is just the page range.

> 
>>> Filter compression is not worthwhile.  You want to have a fairly uniform
>>> hash distribution, and you want to end up with a Bloom filter that's not
>>> much more than 50% full.  That means that when near 50% full the filter
>>> will not be very sparse at all.  Optimizing for the not common case
>>> doesn't seem worthwhile, and adds complexity.
>>
>> Properly sizing the bloom filter requires knowledge of many variables,
>> in particular the number of distinct values expected to be added to the
>> filter. But we don't really know that number, and we also don't even
>> know many values useful for estimating that (how many rows will fit into
>> a range, number of distinct values in the whole table, etc.)
> 
> This is why Scalable Bloom filters exist: so you can adapt.
> 
>> So the idea was to oversize the bloom filter, and then use the sparse
>> representation to reduce the size.
> 
> A space-efficient representation of sparse bitmaps is not as simple as a
> Scalable Bloom filter.
> 
> And a filter that requires user intervention to size correctly, or which
> requires rebuilding when it gets too full, is also not as convenient as
> a Scalable Bloom filter.
> 

Maybe. For now I'm fine with the simple relopts-based approach and don't
plan to spend much time on these improvements. Which is not to say that
they may not be worthwhile, but it's not the only thing I'm working on.

I also suspect there are practical implications, as some things are only
possible with equally-sized bloom filters. I believe the "union"
(missing in the current patch) will rely on that when merging bloom filters.

>>> + * XXX We can also watch the number of bits set in the bloom filter, and
>>> + * then stop using it (and not store the bitmap, to save space) when the
>>> + * false positive rate gets too high.
>>>
>>> Ah, right, what you want is a "Scalable Bloom Filter".
>>
>> That's not what I had in mind. My idea was to size the bloom filter on
>> "best effort" basis, and then if one range gets a bit too inaccurate
>> then just get rid of the filter. If that happens for many ranges, we
>> should probably allow specifying parameters as relopts for the index.
> 
> Scalable Bloom filters are way more convenient than that.  They're
> always not-too-full, and only the open filter is less-than-full-enough.
> 
> And since you should grow them somewhat exponentially (but not quite as
> much as a doubling in each generation), there should never be too many
> filters to search.  But you can always "vacuum" (rebuild) the filter
> starting with the size of the last filter added prior to the vacuum.
> 
>> I think this is really an over-engineering, and I certainly don't plan
&

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
Hi,

On 10/27/2017 05:22 PM, Sokolov Yura wrote:
> 
> Hi, Tomas
> 
> BRIN bloom index is a really cool feature, that definitely should be in
> core distribution (either in contrib or builtin)!!!
> 
> Small suggestion for algorithm:
> 
> It is well known practice not to calculate whole hash function for every
> point, but use double hashing approach.
> Example of proving double hashing approach for bloom filters:
> https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
> 
> So, instead of:
> for (i = 0; i < filter->nhashes; i++)
> {
>     /* compute the hashes with a seed, used for the bloom filter */
>     uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value,
> i))) % filter->nbits;
> 
>     /* set or check bit h */
> }
> 
> such construction could be used:
> /* compute the hashes, used for the bloom filter */
> uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i)));
> uint32 h = big_h % filter->nbits;
> /* ensures d is never >= filter->nbits */
> uint32 d = big_h % (filter->nbits - 1);
> 
> for (i = 0; i < filter->nhashes; i++)
> {
>     /* set or check bit h */
> 
>     /* compute next bit h */
>     h += d++;
>     if (h >= filter->nbits) h -= filter->nbits;
>     if (d == filter->nbits) d = 0;
> }
> 
> Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`)
> gives two good-quality functions usable for double hashing.
> 

OK, thanks for the suggestion.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
Hi,

On 10/27/2017 07:17 PM, Nico Williams wrote:
> On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote:
> 
> A bloom filter index would, indeed, be wonderful.
> 
> Comments:
> 
> + * We use an optimisation that initially we store the uint32 values directly,
> + * without the extra hashing step. And only later filling the bitmap space,
> + * we switch to the regular bloom filter mode.
> 
> I don't think that optimization is worthwhile.  If I'm going to be using
> a Bloom filter, it's because I expect to have a lot of rows.
> 
> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new
> table just won't have lots of rows, then I might want this optimization,
> but I can always just drop the Bloom filter index, or not include
> indexes in the LIKE.)
> 

I think you're confusing "rows" and "distinct values". Furthermore, it's
rather irrelevant how many rows are in the table because BRIN indexes
split the table into "ranges" that are 1MB by default. And you can only
squash certain number of rows into such range.

The idea of the optimization is to efficiently support cases where each
range contains only small number of distinct values - which is quite
common in the cases I described in my initial message (with bursts of
rows with the same IP / client identifier etc.).

> + * XXX Perhaps we could save a few bytes by using different data types, but
> + * considering the size of the bitmap, the difference is negligible.
> 
> A bytea is all that's needed.  See below.
> 
> + * XXX We could also implement "sparse" bloom filters, keeping only the
> + * bytes that are not entirely 0. That might make the "sorted" phase
> + * mostly unnecessary.
> 
> Filter compression is not worthwhile.  You want to have a fairly uniform
> hash distribution, and you want to end up with a Bloom filter that's not
> much more than 50% full.  That means that when near 50% full the filter
> will not be very sparse at all.  Optimizing for the not common case
> doesn't seem worthwhile, and adds complexity.
> 

Properly sizing the bloom filter requires knowledge of many variables,
in particular the number of distinct values expected to be added to the
filter. But we don't really know that number, and we also don't even
know many values useful for estimating that (how many rows will fit into
a range, number of distinct values in the whole table, etc.)

So the idea was to oversize the bloom filter, and then use the sparse
representation to reduce the size.

> + * XXX We can also watch the number of bits set in the bloom filter, and
> + * then stop using it (and not store the bitmap, to save space) when the
> + * false positive rate gets too high.
> 
> Ah, right, what you want is a "Scalable Bloom Filter".
> 

That's not what I had in mind. My idea was to size the bloom filter on
"best effort" basis, and then if one range gets a bit too inaccurate
then just get rid of the filter. If that happens for many ranges, we
should probably allow specifying parameters as relopts for the index.

> A Scalable Bloom filter is actually a series of Bloom filters where all
> but the newest filter are closed to additions, and the newest filter is
> where you do all the additions.  You generally want to make each new
> filter bigger than the preceding one because when searching a Scalable
> Bloom filter you have to search *all* of them, so you want to minimize
> the number of filters.
> 
> Eventually, when you have enough sub-filters, you may want to re-write
> the whole thing so that you start with a single sub-filter that is large
> enough.
> 
> The format of the bytea might then be something like:
> 
> [[[...]]
> 
> where the last bitmap is the filter that is open to additions.
>

I think this is really an over-engineering, and I certainly don't plan
to extend the patch in this direction.

I do not expect these parameters (particularly the number of distinct
values in a range) to significantly change over time, so the easiest
solution is to provide a reloption to specify that number in
CREATE/ALTER index.

Alternatively, we could allow the summarization process to gather some
statistics (either on the whole range, or perhaps the whole table), and
store them somewhere for later use. For example to compute the number of
distinct values per range (in the existing data), and use that later.

> ...
> 
> Of course, for something like:
> 
>   SELECT a.*, b.* FROM a a JOIN b b USING (some_col);
> 
> a Bloom filter can only be used as an optimization to avoid using a
> slower index (or heap scan) on the inner table source.
> 
> What I'm getting at is that the query planner really needs to understand
> that a Bloom filter is a probabilistic data structure.
> 

It does, and

Re: [HACKERS] WIP: BRIN bloom indexes

2017-10-27 Thread Tomas Vondra
hi,

On 10/27/2017 09:34 AM, Simon Riggs wrote:
> On 27 October 2017 at 07:20, Robert Haas <robertmh...@gmail.com> wrote:
>> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra
>> <tomas.von...@2ndquadrant.com> wrote:
>>> Let's see a query like this:
>>>
>>> select * from bloom_test
>>>  where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
>>>
>>> The minmax index produces this plan
>>>
>>>Heap Blocks: lossy=2061856
>>>  Execution time: 22707.891 ms
>>>
>>> Now, the bloom index:
>>>
>>>Heap Blocks: lossy=25984
>>>  Execution time: 338.946 ms
>>
>> It's neat to see BRIN being extended like this.  Possibly we could
>> consider making it a contrib module rather than including it in core,
>> although I don't have strong feelings about it.
> 
> I see that SP-GIST includes two operator classes in core, one default.
> 
> Makes sense to do the same thing with BRIN and add this new op class
> as a non-default option in core.
> 

Not sure "a number of in-core opclasses" is a good reason to (not) add
new ones. Also, we already have two built-in BRIN opclasses (minmax and
inclusion).

In general, "BRIN bloom" can be packed as a contrib module (at least I
believe so). That's not the case for the "BRIN multi-range" which also
requires some changes to some code in brin.c (but the rest can be moved
to contrib module, of course).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled

2017-10-27 Thread Tomas Vondra


On 10/27/2017 07:56 AM, sanyam jain wrote:
> Hi,
> 
> I was reading the
> blog https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes .
> 

For the record, I assume you're referring to this part:

With BIGSERIAL new values are sequential, and so get inserted to the
same leaf pages in the btree index. As only the first modification
to a page triggers the full-page write, only a tiny fraction of the
WAL records are FPIs. With UUID it’s completely different case, of
couse – the values are not sequential at all, in fact each insert is
likely to touch completely new leaf index leaf page (assuming the
index is large enough).

> My queries:
> 
> How randomness of UUID will likely to create new leaf page in btree index?
> In my understanding as the size of UUID is 128 bits i.e. twice of
> BIGSERIAL , more number of pages will be required to store the same
> number of rows and hence there can be increase in WAL size due to FPW .
> When compared the index size in local setup UUID index is ~2x greater in
> size.
> 

Perhaps this is just a poor choice of words on my side, but I wasn't
suggesting new leaf pages will be created but merely that the inserts
will touch a different (possibly existing) leaf page. That's a direct
consequence of the inherent UUID randomness.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: BRIN multi-range indexes

2017-10-25 Thread Tomas Vondra
Apparently I've managed to botch the git format-patch thing :-( Attached
are both patches (the first one adding BRIN bloom indexes, the other one
adding the BRIN multi-range). Hopefully I got it right this time ;-)

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From c27f02d2839e08ebcee852448ed3838c8932d2ea Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Mon, 23 Oct 2017 22:48:58 +0200
Subject: [PATCH 1/2] brin bloom v1

---
 doc/src/sgml/brin.sgml   | 236 ++
 src/backend/access/brin/Makefile |   2 +-
 src/backend/access/brin/brin_bloom.c | 727 +++
 src/include/catalog/pg_amop.h|  59 +++
 src/include/catalog/pg_amproc.h  | 153 +++
 src/include/catalog/pg_opclass.h |  25 ++
 src/include/catalog/pg_opfamily.h|  20 +
 src/include/catalog/pg_proc.h|  10 +
 src/test/regress/expected/opr_sanity.out |   3 +-
 9 files changed, 1233 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/brin/brin_bloom.c

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index b7483df..49d53e1 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -118,6 +118,13 @@


 
+ abstime_bloom_ops
+ abstime
+ 
+  =
+ 
+
+
  abstime_minmax_ops
  abstime
  
@@ -129,6 +136,13 @@
  
 
 
+ int8_bloom_ops
+ bigint
+ 
+  =
+ 
+
+
  int8_minmax_ops
  bigint
  
@@ -180,6 +194,13 @@
  
 
 
+ bytea_bloom_ops
+ bytea
+ 
+  =
+ 
+
+
  bytea_minmax_ops
  bytea
  
@@ -191,6 +212,13 @@
  
 
 
+ bpchar_bloom_ops
+ character
+ 
+  =
+ 
+
+
  bpchar_minmax_ops
  character
  
@@ -202,6 +230,13 @@
  
 
 
+ char_bloom_ops
+ "char"
+ 
+  =
+ 
+
+
  char_minmax_ops
  "char"
  
@@ -213,6 +248,13 @@
  
 
 
+ date_bloom_ops
+ date
+ 
+  =
+ 
+
+
  date_minmax_ops
  date
  
@@ -224,6 +266,13 @@
  
 
 
+ float8_bloom_ops
+ double precision
+ 
+  =
+ 
+
+
  float8_minmax_ops
  double precision
  
@@ -235,6 +284,13 @@
  
 
 
+ inet_bloom_ops
+ inet
+ 
+  =
+ 
+
+
  inet_minmax_ops
  inet
  
@@ -258,6 +314,13 @@
  
 
 
+ int4_bloom_ops
+ integer
+ 
+  =
+ 
+
+
  int4_minmax_ops
  integer
  
@@ -269,6 +332,13 @@
  
 
 
+ interval_bloom_ops
+ interval
+ 
+  =
+ 
+
+
  interval_minmax_ops
  interval
  
@@ -280,6 +350,13 @@
  
 
 
+ macaddr_bloom_ops
+ macaddr
+ 
+  =
+ 
+
+
  macaddr_minmax_ops
  macaddr
  
@@ -291,6 +368,13 @@
  
 
 
+ macaddr8_bloom_ops
+ macaddr8
+ 
+  =
+ 
+
+
  macaddr8_minmax_ops
  macaddr8
  
@@ -302,6 +386,13 @@
  
 
 
+ name_bloom_ops
+ name
+ 
+  =
+ 
+
+
  name_minmax_ops
  name
  
@@ -313,6 +404,13 @@
  
 
 
+ numeric_bloom_ops
+ numeric
+ 
+  =
+ 
+
+
  numeric_minmax_ops
  numeric
  
@@ -324,6 +422,13 @@
  
 
 
+ pg_lsn_bloom_ops
+ pg_lsn
+ 
+  =
+ 
+
+
  pg_lsn_minmax_ops
  pg_lsn
  
@@ -335,6 +440,13 @@
  
 
 
+ oid_bloom_ops
+ oid
+ 
+  =
+ 
+
+
  oid_minmax_ops
  oid
  
@@ -366,6 +478,13 @@
  
 
 
+ float4_bloom_ops
+ real
+ 
+  =
+ 
+
+
  float4_minmax_ops
  real
  
@@ -377,6 +496,13 @@
  
 
 
+ reltime_bloom_ops
+ reltime
+ 
+  =
+ 
+
+
  reltime_minmax_ops
  reltime
  
@@ -388,6 +514,13 @@
  
 
 
+ int2_bloom_ops
+ smallint
+ 
+  =
+ 
+
+
  int2_minmax_ops
  smallint
  
@@ -399,6 +532,13 @@
  
 
 
+ text_bloom_ops
+ text
+ 
+  =
+ 
+
+
  text_minmax_ops
  text
  
@@ -410,6 +550,13 @@
  
 
 
+ tid_bloom_ops
+ tid
+ 
+  =
+ 
+
+
  tid_minmax_ops
  tid
  
@@ -421,6 +568,13 @@
  
 
 
+ timestamp_bloom_ops
+ timestamp without time zone
+ 
+  =
+ 
+
+
  timestamp_minmax_ops
  timestamp without time zone
  
@@ -432,6 +586,13 @@
  
 
 
+ timestamptz_bloom_ops
+ timestamp with time zone
+ 
+  =
+ 
+
+
  timestamptz_minmax_ops
  timestamp with t

[HACKERS] WIP: BRIN multi-range indexes

2017-10-25 Thread Tomas Vondra
 '1'::double precision))
   Rows Removed by Index Recheck: 3361
   Heap Blocks: lossy=64
   ->  Bitmap Index Scan on a_val_idx
 (cost=0.00..830.01 rows=10200 width=0)
 (actual time=9.274..9.274 rows=640 loops=1)
 Index Cond: ((val >= '100'::double precision) AND
  (val <= '1'::double precision))
 Planning time: 0.138 ms
 Execution time: 36.100 ms
(8 rows)

That is, the timings drop from 785ms/871ms to 9ms/36s. The index is a
bit larger (1700kB instead of 150kB), but it's still orders of
magnitudes smaller than btree index (which is ~214MB in this case).

The index build is slower than the regular BRIN indexes (about
comparable to btree), but I'm sure it can be significantly improved.
Also, I'm sure it's not bug-free.

Two additional notes:

1) The patch does break the current BRIN indexes, because it requires
passing all SearchKeys to the "consistent" BRIN function at once
(otherwise we couldn't eliminate individual intervals in the summary),
while currently the BRIN only deals with one SearchKey at a time. And I
haven't modified the existing brin_minmax_consistent() function (yeah,
I'm lazy, but this should be enough for interested people to try it out
I believe).

2) It only works with float8 (and also timestamp data types) for now,
but it should be straightforward to add support for additional data
types. Most of that will be about adding catalog definitions anyway.


regards


-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c
index 6ec3297..94d696e 100644
--- a/src/backend/access/brin/brin_minmax_multi.c
+++ b/src/backend/access/brin/brin_minmax_multi.c
@@ -29,9 +29,6 @@ typedef struct MinmaxMultiOpaque
 	FmgrInfo	strategy_procinfos[BTMaxStrategyNumber];
 } MinmaxMultiOpaque;
 
-#define		MODE_VALUES			1
-#define		MODE_RANGES			2
-
 #define		MINMAX_MAX_VALUES	64
 
 typedef struct MinmaxMultiRanges
@@ -39,12 +36,13 @@ typedef struct MinmaxMultiRanges
 	/* varlena header (do not touch directly!) */
 	int32	vl_len_;
 
-	int		mode;		/* values or ranges in the data array */
+	/* maxvalues >= (2*nranges + nvalues) */
 	int		maxvalues;	/* maximum number of values in the buffer */
-	int		numvalues;	/* number of values in the data buffer */
+	int		nranges;	/* number of ranges stored in the array */
+	int		nvalues;	/* number of values in the data array */
 
 	/* values stored for this range - either raw values, or ranges */
-	Datum 	data[FLEXIBLE_ARRAY_MEMBER];
+	Datum 	values[FLEXIBLE_ARRAY_MEMBER];
 
 } MinmaxMultiRanges;
 
@@ -68,13 +66,11 @@ minmax_multi_init(int maxvalues)
 	/*
 	 * Allocate the range list with space for the max number of values.
 	 */
-	len = offsetof(MinmaxMultiRanges, data) + maxvalues * sizeof(Datum);
+	len = offsetof(MinmaxMultiRanges, values) + maxvalues * sizeof(Datum);
 
 	ranges = (MinmaxMultiRanges *) palloc0(len);
 
-	ranges->mode = MINMAX_MAX_VALUES;
 	ranges->maxvalues = maxvalues;
-	ranges->mode = MODE_VALUES;	/* start by accumulating values */
 
 	SET_VARSIZE(ranges, len);
 
@@ -87,6 +83,12 @@ typedef struct compare_context
 	Oid			colloid;
 } compare_context;
 
+typedef struct DatumRange {
+	Datum	minval;
+	Datum	maxval;
+	bool	collapsed;
+} DatumRange;
+
 static int
 compare_values(const void *a, const void *b, void *arg)
 {
@@ -109,27 +111,77 @@ compare_values(const void *a, const void *b, void *arg)
 	return 0;
 }
 
+static int
+compare_ranges(const void *a, const void *b, void *arg)
+{
+	DatumRange ra = *(DatumRange *)a;
+	DatumRange rb = *(DatumRange *)b;
+	Datum r;
+
+	compare_context *cxt = (compare_context *)arg;
+
+	r = FunctionCall2Coll(cxt->cmpFn, cxt->colloid, ra.minval, rb.minval);
+
+	if (DatumGetBool(r))
+		return -1;
+
+	r = FunctionCall2Coll(cxt->cmpFn, cxt->colloid, rb.minval, ra.minval);
+
+	if (DatumGetBool(r))
+		return 1;
+
+	return 0;
+}
+
+/*
 static void
 print_range(char * label, int numvalues, Datum *values)
 {
-	int i;
+	int idx;
 	StringInfoData str;
 
 	initStringInfo();
 
-	for (i = 0; i < (numvalues/2); i++)
+	idx = 0;
+	while (idx < 2*ranges->nranges)
+	{
+		if (idx == 0)
+			appendStringInfoString(, "RANGES: [");
+		else
+			appendStringInfoString(, ", ");
+
+		appendStringInfo(, "%d => [%.9f, %.9f]", idx/2, DatumGetFloat8(values[idx]), DatumGetFloat8(values[idx+1]));
+
+		idx += 2;
+	}
+
+	if (ranges->nranges > 0)
+		appendStringInfoString(, "]");
+
+	if ((ranges->nranges > 0) && (ranges->nvalues > 0))
+		appendStringInfoString(, " ");
+
+	while (idx < 2*ranges->nranges + ranges->nvalues)
 	{
-		if (i > 0)
+		if (idx == 2*ranges->nranges)
+			appendStringInfoString(, "VALUES: [");
+		else
 			append

[HACKERS] CUBE seems a bit confused about ORDER BY

2017-10-19 Thread Tomas Vondra
Hi,

I've noticed this suspicious behavior of "cube" data type with ORDER BY,
which I believe is a bug in the extension (or the GiST index support).
The following example comes directly from regression tests added by
33bd250f (so CC Teodor and Stas, who are mentioned in the commit).

This query should produce results with ordering "ascending by 2nd
coordinate or upper right corner". To make it clear, I've added the
"c~>4" expression to the query, otherwise it's right from the test.

test=# SELECT c~>4 "c~>4", * FROM test_cube ORDER BY c~>4 LIMIT 15;
 c~>4 | c
--+---
   50 | (30333, 50),(30273, 6)
   75 | (43301, 75),(43227, 43)
  142 | (19650, 142),(19630, 51)
  160 | (2424, 160),(2424, 81)
  171 | (3449, 171),(3354, 108)
  155 | (18037, 155),(17941, 109)
  208 | (28511, 208),(28479, 114)
  217 | (19946, 217),(19941, 118)
  191 | (16906, 191),(16816, 139)
  187 | (759, 187),(662, 163)
  266 | (22684, 266),(22656, 181)
  255 | (24423, 255),(24360, 213)
  249 | (45989, 249),(45910, 222)
  377 | (11399, 377),(11360, 294)
  389 | (12162, 389),(12103, 309)
(15 rows)

As you can see, it's not actually sorted by the c~>4 coordinate (but by
c~>2, which it the last number).

Moreover, disabling index scans fixes the ordering:

test=# set enable_indexscan = off;
SET
test=# SELECT c~>4, * FROM test_cube ORDER BY c~>4 LIMIT 15; --
ascending by 2nd coordinate or upper right corner
 ?column? | c
--+---
   50 | (30333, 50),(30273, 6)
   75 | (43301, 75),(43227, 43)
  142 | (19650, 142),(19630, 51)
  155 | (18037, 155),(17941, 109)
  160 | (2424, 160),(2424, 81)
  171 | (3449, 171),(3354, 108)
  187 | (759, 187),(662, 163)
  191 | (16906, 191),(16816, 139)
  208 | (28511, 208),(28479, 114)
  217 | (19946, 217),(19941, 118)
  249 | (45989, 249),(45910, 222)
  255 | (24423, 255),(24360, 213)
  266 | (22684, 266),(22656, 181)
  367 | (31018, 367),(30946, 333)
  377 | (11399, 377),(11360, 294)
(15 rows)


Seems like a bug somewhere in gist_cube_ops, I guess?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP: BRIN bloom indexes

2017-10-19 Thread Tomas Vondra
Hi,

The BRIN minmax opclasses work well only for data where the column is
somewhat correlated to physical location in a table. So it works great
for timestamps in append-only log tables, for example. When that is not
the case (non-correlated columns) the minmax ranges get very "wide" and
we end up scanning large portions of the tables.

A typical example are columns with types that are inherently random (or
rather non-correlated) like for example UUIDs, IP or MAC addresses, and
so on. But it can just as easily happen even with simple IDs generated
from a sequence.

This WIP patch implements another type of BRIN index, with "summary"
being represented by a bloom filter. So instead of building [min,max]
range for each page range. The index is of course larger than the
regular "minmax" BRIN index, but it's still orders of magnitude smaller
than a btree index.

Note: This is different from the index type implemented in the "bloom"
extension. Those are not related to BRIN at all, and the index builds a
bloom filter on individual rows (values in different columns).

BTW kudos to everyone who worked on the BRIN infrastructure and API. I
found it very intuitive and well-documented. Adding the new opclass was
extremely straight-forward, and


To demonstrate this on a small example, consider this table:

CREATE TABLE bloom_test (id uuid, padding text);

INSERT INTO bloom_test
SELECT md5((mod(i,100)/100)::text)::uuid, md5(i::text)
  FROM generate_series(1,2) s(i);

VACUUM ANALYZE bloom_test;

 List of relations
 Schema |Name| Type  | Owner | Size  | Description
++---+---+---+-
 public | bloom_test | table | tomas | 16 GB |
(1 row)

The table was built so that each range contains relatively small number
of distinct UUID values - this is typical e.g. for user activity with
"bursts" of actions from a particular user separated by long periods of
inactivity (with actions from other users).

Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id
column.

create index test_brin_idx on bloom_test
 using brin (id);

create index test_bloom_idx on bloom_test
 using brin (id uuid_bloom_ops);

create index test_btree_idx on bloom_test (id);

which gives us this:

  List of relations
 Schema |  Name  | Type  | Owner |   Table|  Size
++---+---++-
 public | test_bloom_idx | index | tomas | bloom_test | 12 MB
 public | test_brin_idx  | index | tomas | bloom_test | 832 kB
 public | test_btree_idx | index | tomas | bloom_test | 6016 MB
(3 rows)

So yeah - the "bloom" index is larger than the default "minmax" index,
but it's still orders of magnitude smaller than the regular btree one.

Now, what about query performance? Unlike the "minmax" indexes, the
"bloom" filter can only handle equality conditions.

Let's see a query like this:

select * from bloom_test
 where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';

The minmax index produces this plan

QUERY PLAN

 Bitmap Heap Scan on bloom_test
 (cost=390.31..2130910.82 rows=20593 width=49)
 (actual time=197.974..22707.311 rows=2 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 19998
   Heap Blocks: lossy=2061856
   ->  Bitmap Index Scan on test_brin_idx
   (cost=0.00..385.16 rows=5493161 width=0)
   (actual time=133.463..133.463 rows=20619520 loops=1)
 Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
 Planning time: 0.165 ms
 Execution time: 22707.891 ms
(8 rows)

Which is not that great, and the long duration is a direct consequence
of "wide" ranges - the bitmap heap scan had to scan pretty much the
whole table. (A sequential scan takes only about 15 seconds.)

Now, the bloom index:

QUERY PLAN

 Bitmap Heap Scan on bloom_test
 (cost=5898.31..2136418.82 rows=20593 width=49)
 (actual time=24.229..338.324 rows=2 loops=1)
   Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
   Rows Removed by Index Recheck: 2500448
   Heap Blocks: lossy=25984
   ->  Bitmap Index Scan on test_bloom_idx
 (cost=0.00..5893.16 rows=5493161 width=0)
 (actual time=22.811..22.811 rows=259840 loops=1)
 Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
 Planning time: 0.201 ms
 Execution time: 338.946 ms
(8 rows)

So, way better.

For comparison, a simple index scan / bitmap index scan using the btree
take only about ~10ms in t

Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2017-10-17 Thread Tomas Vondra


On 10/17/2017 03:16 PM, Robert Haas wrote:
> On Sat, Oct 14, 2017 at 2:14 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> I propose is to add a new cursor option (PARALLEL), which would allow
>> parallel plans for that particular user-defined cursor. Attached is an
>> experimental patch doing this (I'm sure there are some loose ends).
> 
> I think you would need to do a huge amount of additional work in
> order to actually make this robust. I believe that a fair amount of
> what goes on in parallel mode right now is checked with elog() if we
> think that it's unreachable without writing C code -- but this will
> make a lot of those things reachable, which means they would need to
> be checked some other way.

Sure, additional checks may be necessary. I've tried to come up with
examples causing crashes, and haven't succeeded. Of course, that's no
proof of correctness, so if you have an example that will crash and burn
I'd like to see see it.

In general, it may be acceptable to rely on the elog() checks - which is
pretty much what the FETCH+INSERT+SHARE example I shared in the first
message of this thread does. I agree it's not particularly convenient,
and perhaps we should replace it with checks while planning the queries.

> Also, I doubt this guarantees that we won't try to call
> parallel-unsafe functions will parallel mode is active, so things
> will just blow up in whatever way they do, maybe crashing the server
> or rendering the database inconsistent or whatever.
> 

Probably. What happens right now is that declaring the cursor switches
the whole transaction into parallel mode (EnterParallelMode), so if you
do FETCH + INSERT + FETCH that will fail with elog(ERROR).

But yeah, this should probably be checked at planning time, and the
whole query should fail if the transaction is in parallel mode and the
query contains unsafe/restricted functions.

> Possibly I'm overestimating the extent of the danger, but I don't
> think so.  You're try to take a mechanism that was only ever meant to
> be active during the course of one query and applying it for long
> periods of time during which a user can do anything, with basically no
> upgrade of the infrastructure.  I think something like this could be
> made to work if you put a large amount of energy into it, but I think
> the patch as proposed is about the easiest 3-5% of what would actually
> be required to cover all the holes.
> 

Well, soliciting feedback like this was one of the points of sharing the
experimental patch, so thank you for that. I'm not sure if the estimate
of 3-5% is accurate, but I'm sure the patch is incomplete - which is why
I marked it as experimental, after all.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-17 Thread Tomas Vondra
On 10/17/2017 02:29 PM, Justin Pryzby wrote:
> On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote:
>> Anyway, can give this patch a try?
> 
> I've only compiled postgres once before and this is a production environment
> (althought nothing so important that the crashes are a serious concern 
> either).
> 
> Is it reasonable to wget the postgres tarball, apply the patch, and run the
> compiled postgres binary from the source tree, without running make install or
> similar ?  Otherwise, would it be good enough to copy the postgres binary to
> /usr/pgsql-10/bin (and reinstall the binary package later) ?
> 

You don't have to install the binaries to the same location, i.e. you
can keep both the current and modified binaries.

./configure --prefix=/path/to/alternative/binaries --enable-debug
CFLAGS="..."

To get the same compilation options you can run pg_config, look for
CONFIGURE line and then just modify add --prefix option.

And after `make install` you can add it to $PATH and start the server
using those binaries.

$ export PATH=/path/to/alternative/binaries/bin:$PATH
$ which pg_ctl
$ pg_ctl -D $DATADIR stop
$ pg_ctl -D $DATADIR start


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2017-10-16 Thread Tomas Vondra


On 10/16/2017 01:13 PM, Amit Kapila wrote:
> On Sat, Oct 14, 2017 at 11:44 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> Hi,
>>
>>
>> I propose is to add a new cursor option (PARALLEL), which would allow
>> parallel plans for that particular user-defined cursor. Attached is an
>> experimental patch doing this (I'm sure there are some loose ends).
>>
> 
> How will this work for backward scans?
> 

It may not work, just like for regular cursors without SCROLL. And if
you specify SCROLL, then I believe a Materialize node will be added
automatically if needed, but haven't tried.

> 
>>
>> Regarding (2), if the user suspends the cursor for a long time, bummer.
>> The parallel workers will remain assigned, doing nothing. I don't have
>> any idea how to get around that, but I don't see how we could do better.
>>
> 
> One idea could be that whenever someone uses Parallel option, we can 
> fetch and store all the data locally before returning control to the 
> user (something like WITH HOLD option).
> 

I don't like that very much, as it adds unnecessary overhead. As I said
before, I'm particularly interested in cursors returning large amounts
of data, so the overhead may be quite significant.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-15 Thread Tomas Vondra
Hi,

On 10/15/2017 03:56 AM, Justin Pryzby wrote:
> On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
...
>> It's a bit difficult to guess what went wrong from this backtrace. For
>> me gdb typically prints a bunch of lines immediately before the frames,
>> explaining what went wrong - not sure why it's missing here.
> 
> Do you mean this ?
> 
> ...
> Loaded symbols for /lib64/libnss_files-2.12.so
> Core was generated by `postgres: autovacuum worker process   gtt 
> '.
> Program terminated with signal 11, Segmentation fault.
> #0  pfree (pointer=0x298c740) at mcxt.c:954
> 954 (*context->methods->free_p) (context, pointer);
> 

Yes. So either 'context' is bogus. Or 'context->methods' is bogus. Or
'context->methods->free_p' is bogus.

>> Perhaps some of those pointers are bogus, the memory was already pfree-d
>> or something like that. You'll have to poke around and try dereferencing
>> the pointers to find what works and what does not.
>>
>> For example what do these gdb commands do in the #0 frame?
>>
>> (gdb) p *(MemoryContext)context
> 
> (gdb) p *(MemoryContext)context
> Cannot access memory at address 0x7474617261763a20
> 

OK, this means the memory context pointer (tracked in the header of a
chunk) is bogus. There are multiple common ways how that could happen:

* Something corrupts memory (typically out-of-bounds write).

* The pointer got allocated in an incorrect memory context (which then
was released, and the memory was reused for new stuff).

* It's a use-after-free.

* ... various other possibilities ...

> 
> I uploaded the corefile:
> http://telsasoft.com/tmp/coredump-postgres-autovacuum-brin-summarize.gz
> 

Thanks, but I'm not sure that'll help, at this point. We already know
what happened (corrupted memory), we don't know "how". And core files
are mostly just "snapshots" so are not very useful in answering that :-(

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGSEGV in BRIN autosummarize

2017-10-14 Thread Tomas Vondra
Hi,

On 10/15/2017 12:42 AM, Justin Pryzby wrote:
> On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
>> I don't have any reason to believe there's memory issue on the server, So I
>> suppose this is just a "heads up" to early adopters until/in case it happens
>> again and I can at least provide a stack trace.
> 
> I'm back; find stacktrace below.
> 
>> Today I see:
>> < 2017-10-13 17:22:47.839 -04  >LOG:  server process (PID 32127) was 
>> terminated by signal 11: Segmentation fault
>> < 2017-10-13 17:22:47.839 -04  >DETAIL:  Failed process was running: 
>> autovacuum: BRIN summarize public.gtt 747263
> 
> Is it a coincidence the server failed within 45m of yesterday's failure ?
> 

Most likely just a coincidence.

> postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in 
> postgres[40+692000]
> < 2017-10-14 18:05:36.432 -04  >DETAIL:  Failed process was running: 
> autovacuum: BRIN summarize public.gtt 41087
> 
>> It looks like this table was being inserted into simultaneously by a python
>> program using multiprocessing.  It looks like each subprocess was INSERTing
>> into several tables, each of which has one BRIN index on timestamp column.
> 
> I should add:
> These are insert-only child tables in a heirarchy (not PG10 partitions), being
> inserted into directly (not via trigger/rule).
> 
> Also notice the vacuum process was interrupted, same as yesterday (think
> goodness for full logs).  Our INSERT script is using python
> multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
> one file and then exit the subprocess, and pool() creates a new subproc, which
> starts a new PG session and transaction.  Which explains why autovacuum starts
> processing the table only to be immediately interrupted.
> 

I don't follow. Why does it explain that autovacuum gets canceled? I
mean, merely opening a new connection/session should not cancel
autovacuum. That requires a command that requires table-level lock
conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...).

> ... 
> Due to a .."behavioral deficiency" in the loader for those tables, the crashed
> backend causes the loader to get stuck, so the tables should be untouched 
> since
> the crash, should it be desirable to inspect them.
> 

It's a bit difficult to guess what went wrong from this backtrace. For
me gdb typically prints a bunch of lines immediately before the frames,
explaining what went wrong - not sure why it's missing here.

Perhaps some of those pointers are bogus, the memory was already pfree-d
or something like that. You'll have to poke around and try dereferencing
the pointers to find what works and what does not.

For example what do these gdb commands do in the #0 frame?

(gdb) p *(MemoryContext)context
(gdb) p *GetMemoryChunkContext(pointer)

> #0  pfree (pointer=0x298c740) at mcxt.c:954
> context = 0x7474617261763a20
> #1  0x006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at 
> autovacuum.c:2676
> cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid 
> 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 146} {CONST :consttype 
> 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull 
> fal"...
> cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 
> :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 
> 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 
> :constbyv"...
> cur_relname = 0x298cd68 
> "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"
> __func__ = "perform_work_item"
> #2  0x006a6fd9 in do_autovacuum () at autovacuum.c:2533
...

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-14 Thread Tomas Vondra


On 10/14/2017 07:49 PM, Robert Haas wrote:
> On Fri, Oct 13, 2017 at 4:49 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> tps = 8282.481310 (including connections establishing)
>> tps = 8282.750821 (excluding connections establishing)
> 
> vs.
> 
>> tps = 8520.822410 (including connections establishing)
>> tps = 8521.132784 (excluding connections establishing)
>>
>> With the patch we are making use of the extended statistics, which
>> we do expect to be more work for the planner. Although, we didn't
>> add extended statistics to speed up the planner.
> 
> Sure, I understand. That's actually a pretty substantial regression
> - I guess that means that it's pretty important to avoid creating 
> extended statistics that are not needed, at least for short-running 
> queries.
> 

Well, it's only about 3% difference in a single run, which may be easily
due to slightly different binary layout, random noise etc. So I wouldn't
call that "substantial regression", at least not based on this one test.

I've done more thorough testing, and what I see is 1.0-1.2% drop, but on
a test that's rather extreme (statistics on empty table). So again,
likely well within noise, and on larger tables it'll get even less
significant.

But of course - it's not free. It's a bit more work we need to do. But
if you don't need multi-column statistics, don't create them. If your
queries are already fast, you probably don't need them at all.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2017-10-14 Thread Tomas Vondra
Hi,

One of the existing limitations of parallel query is that cursors
generally do not benefit from it [1]. Commit 61c2e1a95f [2] improved the
situation for cursors from procedural languages, but unfortunately for
user-defined cursors parallelism is still disabled.

For many use cases that is perfectly fine, but for applications that
need to process large amounts of data this is rather annoying. When the
result sets are large, cursors are extremely efficient - in terms of
memory consumption, for example. So the applications have to choose
between "cursor" approach (and no parallelism), or parallelism and
uncomfortably large result sets.

I believe there are two main reasons why parallelism is disabled for
user-defined cursors (or queries that might get suspended):

(1) We can't predict what will happen while the query is suspended (and
the transaction is still in "parallel mode"), e.g. the user might run
arbitrary DML which is not allowed.

(2) If the cursor gets suspended, the parallel workers would be still
assigned to it and could not be used for anything else.

Clearly, we can't solve those issues in general, so the default will
probably remain "parallelism disabled".

I propose is to add a new cursor option (PARALLEL), which would allow
parallel plans for that particular user-defined cursor. Attached is an
experimental patch doing this (I'm sure there are some loose ends).

This does not make either any of the issues go away, of course. We still
enforce "no DML while parallel operation in progress" as before, so this
will not work:

BEGIN;
DECLARE x PARALLEL CURSOR FOR SELECT * FROM t2 WHERE ...;
FETCH 1000 FROM x;
INSERT INTO t2 VALUES (1);
FETCH 1000 FROM x;
COMMIT;

but this will

BEGIN;
DECLARE x PARALLEL CURSOR FOR SELECT * FROM t2 WHERE ...;
FETCH 1000 FROM x;
...
FETCH 1000 FROM x;
CLOSE x;
INSERT INTO t2 VALUES (1);
COMMIT;

Regarding (2), if the user suspends the cursor for a long time, bummer.
The parallel workers will remain assigned, doing nothing. I don't have
any idea how to get around that, but I don't see how we could do better.
I don't see either of these limitations as fatal.

Any opinions / obvious flaws that I missed?

regards

[1]
https://www.postgresql.org/docs/9.6/static/when-can-parallel-query-be-used.html

[2]
https://www.postgresql.org/message-id/CAOGQiiMfJ%2B4SQwgG%3D6CVHWoisiU0%2B7jtXSuiyXBM3y%3DA%3DeJzmg%40mail.gmail.com

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c
index 76d6cf1..ffaa096 100644
--- a/src/backend/commands/portalcmds.c
+++ b/src/backend/commands/portalcmds.c
@@ -66,6 +66,12 @@ PerformCursorOpen(DeclareCursorStmt *cstmt, ParamListInfo params,
 		RequireTransactionChain(isTopLevel, "DECLARE CURSOR");
 
 	/*
+	 * Enable parallel plans for cursors that explicitly requested it.
+	 */
+	if (cstmt->options & CURSOR_OPT_PARALLEL)
+		cstmt->options |= CURSOR_OPT_PARALLEL_OK;
+
+	/*
 	 * Parse analysis was done already, but we still have to run the rule
 	 * rewriter.  We do not do AcquireRewriteLocks: we assume the query either
 	 * came straight from the parser, or suitable locks were acquired by
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 9689429..64f8a32 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -423,6 +423,13 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 	/* This should be run once and only once per Executor instance */
 	Assert(!estate->es_finished);
 
+	/* If this was PARALLEL cursor, do cleanup and exit parallel mode. */
+	if (queryDesc->parallel_cursor)
+	{
+		ExecShutdownNode(queryDesc->planstate);
+		ExitParallelMode();
+	}
+
 	/* Switch into per-query memory context */
 	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
 
@@ -1085,6 +1092,18 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 
 	queryDesc->tupDesc = tupType;
 	queryDesc->planstate = planstate;
+
+	/* If this was PARALLEL cursor, enter parallel mode, except in EXPLAIN-only. */
+
+	queryDesc->parallel_cursor
+		= (eflags & EXEC_FLAG_PARALLEL) && !(eflags & EXEC_FLAG_EXPLAIN_ONLY);
+
+	/*
+	 * In PARALLEL cursors we have to enter the parallel mode once, at the very
+	 * beginning (and not in ExecutePlan, as we do for execute_once plans).
+	 */
+	if (queryDesc->parallel_cursor)
+		EnterParallelMode();
 }
 
 /*
@@ -1725,7 +1744,8 @@ ExecutePlan(EState *estate,
 		if (TupIsNull(slot))
 		{
 			/* Allow nodes to release or shut down resources. */
-			(void) ExecShutdownNode(planstate);
+			if (execute_once)
+(void) ExecShutdownNode(planstate);
 			break;
 		}
 
@@ -1772,7 +1792,8 @@ ExecutePlan(EState *estate,
 		if (numberTuples && numberTuples == current_tuple_count)
 		{
 			/* Allow nodes to release or shut down re

Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-13 Thread Tomas Vondra

On 10/13/2017 10:04 PM, Robert Haas wrote:
> On Mon, Oct 9, 2017 at 11:03 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> -- Unpatched
>>  Planning time: 0.184 ms
>>  Execution time: 105.878 ms
>>
>> -- Patched
>>  Planning time: 2.175 ms
>>  Execution time: 106.326 ms
> 
> This might not be the best example to show the advantages of the
> patch, honestly.
> 

Not sure what exactly is your point? If you're suggesting this example
is bad because the planning time increased from 0.184 to 2.175 ms, then
perhaps consider the plans were likely generated on a assert-enabled
build and on a laptop (both of which adds quite a bit of noise to
occasional timings). The patch has no impact on planning time (at least
I've been unable to measure any).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10

2017-10-12 Thread Tomas Vondra


On 10/12/2017 02:40 PM, Dilip Kumar wrote:
> On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> Hi,
>>
>> It seems that Q19 from TPC-H is consistently failing with segfaults due
>> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL).
>>
>> I'm not very familiar with how the dsa is initialized and passed around,
>> but I only see the failures when the bitmap is constructed by a mix of
>> BitmapAnd and BitmapOr operations.
>>
> I think I have got the issue, bitmap_subplan_mark_shared is not
> properly pushing the isshared flag to lower level bitmap index node,
> and because of that tbm_create is passing NULL dsa while creating the
> tidbitmap.  So this problem will come in very specific combination of
> BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the
> BitmapOr.  If BitmapIndex is the first subplan under BitmapOr then
> there is no problem because BitmapOr node will create the tbm by
> itself and isshared is set for BitmapOr.
> 
> Attached patch fixing the issue for me.  I will thoroughly test this
> patch with other scenario as well.  Thanks for reporting.
> 

Yep, this fixes the failures for me.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10

2017-10-12 Thread Tomas Vondra
Hi,

It seems that Q19 from TPC-H is consistently failing with segfaults due
to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL).

I'm not very familiar with how the dsa is initialized and passed around,
but I only see the failures when the bitmap is constructed by a mix of
BitmapAnd and BitmapOr operations.

Another interesting observation is that setting force_parallel_mode=on
may not be enough - there really need to be multiple parallel workers,
which is why the simple query does cpu_tuple_cost=1.

Attached is a bunch of files:

1) details for "full" query:

* query.sql
* plan.txt
* backtrace.txt

2) details for the "minimal" query triggering the issue:

* query-minimal.sql
* plan-minimal.txt
* backtrace-minimal.txt



regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Program terminated with signal 6, Aborted.
#0  0x7fe21265d1f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64
(gdb) bt
#0  0x7fe21265d1f7 in raise () from /lib64/libc.so.6
#1  0x7fe21265e8e8 in abort () from /lib64/libc.so.6
#2  0x008468e7 in ExceptionalCondition 
(conditionName=conditionName@entry=0x9d213a "!(tbm->dsa != ((void *)0))", 
errorType=errorType@entry=0x88fc69 "FailedAssertion", 
fileName=fileName@entry=0x9d2014 "tidbitmap.c", 
lineNumber=lineNumber@entry=800) at assert.c:54
#3  0x0065b04f in tbm_prepare_shared_iterate (tbm=tbm@entry=0x2b244e8) 
at tidbitmap.c:800
#4  0x0062294a in BitmapHeapNext (node=node@entry=0x2adf118) at 
nodeBitmapHeapscan.c:155
#5  0x00616d7a in ExecScanFetch (recheckMtd=0x623050 
, accessMtd=0x622250 , node=0x2adf118) at 
execScan.c:97
#6  ExecScan (node=0x2adf118, accessMtd=0x622250 , 
recheckMtd=0x623050 ) at execScan.c:147
#7  0x00624c75 in ExecProcNode (node=0x2adf118) at 
../../../src/include/executor/executor.h:250
#8  gather_getnext (gatherstate=0x2aded50) at nodeGather.c:281
#9  ExecGather (pstate=0x2aded50) at nodeGather.c:215
#10 0x00610d12 in ExecProcNode (node=0x2aded50) at 
../../../src/include/executor/executor.h:250
#11 ExecutePlan (execute_once=, dest=0x2b09220, 
direction=, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x2aded50, 
estate=0x2adeb00) at execMain.c:1721
#12 standard_ExecutorRun (queryDesc=0x2a3bdf0, direction=, 
count=0, execute_once=) at execMain.c:363
#13 0x0074b50b in PortalRunSelect (portal=portal@entry=0x2a34050, 
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x2b09220) at pquery.c:932
#14 0x0074ca18 in PortalRun (portal=portal@entry=0x2a34050, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
run_once=run_once@entry=1 '\001', dest=dest@entry=0x2b09220, 
altdest=altdest@entry=0x2b09220, 
completionTag=completionTag@entry=0x7ffc8dad21c0 "") at pquery.c:773
#15 0x0074875b in exec_simple_query (
query_string=0x2a96ff0 "select\n*\nfrom\npart\nwhere\n(\n   
 p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')\nand p_size 
between 1 and 5\n)\nor\n(\np_container in ('MED BAG', 'MED 
B"...) at postgres.c:1099
#16 0x00749a03 in PostgresMain (argc=, 
argv=argv@entry=0x2a44048, dbname=0x2a43eb0 "test", username=) 
at postgres.c:4088
#17 0x0047665f in BackendRun (port=0x2a37cc0) at postmaster.c:4357
#18 BackendStartup (port=0x2a37cc0) at postmaster.c:4029
#19 ServerLoop () at postmaster.c:1753
#20 0x006d70d9 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x2a14b20) at postmaster.c:1361
#21 0x004774c1 in main (argc=3, argv=0x2a14b20) at main.c:228

   QUERY PLAN   



 Gather
   Workers Planned: 2
   ->  Parallel Bitmap Heap Scan on part
 Recheck Cond: (((p_size <= 5) AND (p_size >= 1) AND (p_container = ANY 
('{"SM CASE","SM BOX","SM PACK","SM PKG"}'::bpchar[]))) OR ((p_container = ANY 
('{"MED BAG","MED BOX","MED PKG","MED PACK"}'::bpchar[])) AND (p_size <= 10) 
AND (p_size >= 1)))
 ->  BitmapOr
   ->  BitmapAnd
 ->  Bitmap Index Scan on part_p_size_idx
   Index C

[HACKERS] Re: Extended statistics is not working on Vars hidden under a RelabelType

2017-10-10 Thread Tomas Vondra
On 10/10/2017 05:03 AM, David Rowley wrote:
> Basically, $subject is causing us not to properly find matching
> extended stats in this case.
> 
> The attached patch fixes it.
> 
> The following test cases is an example of the misbehaviour. Note
> rows=1 vs rows=98 in the Gather node.
> 

Thanks for noticing this. The patch seems fine to me.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)

2017-09-24 Thread Tomas Vondra
Hi,

Attached is an updated version of the patch, tweaking the comments.

1) I've added a section at the end of src/backend/utils/mmgr/README,
briefly explaining the alternative memory allocators we have. I don't
think we should get into too much low-level detail here, that's more
appropriate for the .c file for each context.

2) I've slightly reworded a paragraph in generation.c describing what
use cases are suitable for the memory context. It used to say:

   This memory context is based on the assumption that the allocated
   chunks have similar lifespan, i.e. that chunks allocated close from
   each other (by time) will also be freed in close proximity, and
   mostly in the same order. This is typical for various queue-like use
   cases, i.e. when tuples are constructed, processed and then thrown
   away.

and now it says:

   This memory context is based on the assumption that the chunks are
   freed roughly in the same order as they were allocated (FIFO), or in
   groups with similar lifespan (generations - hence the name of the
   context). This is typical for various queue-like use cases, i.e. when
   tuples are constructed, processed and then thrown away.

3) I've also added a brief note into reorderbuffer.c mentioning that it
uses SlabContext and GenerationContext. As I explained, I don't think we
should include details about how we tested the patch or whatever here.

regard

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 25806c68a05287f3294f2d8508bd45599232f67b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Sun, 24 Sep 2017 22:19:17 +0200
Subject: [PATCH] Generational memory allocator

This memory context is based on the assumption that the allocated chunks
have similar lifespan, i.e. that chunks allocated close from each other
(by time) will also be freed in close proximity, and mostly in the same
order. This is typical for various queue-like use cases, i.e. when
tuples are constructed, processed and then thrown away.

The memory context uses a very simple approach to free space management.
Instead of a complex global freelist, each block tracks a number
of allocated and freed chunks. The space released by freed chunks is not
reused, and once all chunks are freed (i.e. when nallocated == nfreed),
the whole block is thrown away. When the allocated chunks have similar
lifespan, this works very well and is extremely cheap.
---
 src/backend/replication/logical/reorderbuffer.c |  80 +--
 src/backend/utils/mmgr/Makefile |   2 +-
 src/backend/utils/mmgr/README   |  23 +
 src/backend/utils/mmgr/generation.c | 768 
 src/include/nodes/memnodes.h|   4 +-
 src/include/nodes/nodes.h   |   1 +
 src/include/replication/reorderbuffer.h |  15 +-
 src/include/utils/memutils.h|   5 +
 8 files changed, 819 insertions(+), 79 deletions(-)
 create mode 100644 src/backend/utils/mmgr/generation.c

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 0f607ba..dc0ad5b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -43,6 +43,12 @@
  *	  transaction there will be no other data carrying records between a row's
  *	  toast chunks and the row data itself. See ReorderBufferToast* for
  *	  details.
+ *
+ *	  ReorderBuffer uses two special memory context types - SlabContext for
+ *	  allocations of fixed-length structures (changes and transactions), and
+ *	  GenerationContext for the variable-length transaction data (allocated
+ *	  and freed in groups with similar lifespan).
+ *
  * -
  */
 #include "postgres.h"
@@ -150,15 +156,6 @@ typedef struct ReorderBufferDiskChange
  */
 static const Size max_changes_in_memory = 4096;
 
-/*
- * We use a very simple form of a slab allocator for frequently allocated
- * objects, simply keeping a fixed number in a linked list when unused,
- * instead pfree()ing them. Without that in many workloads aset.c becomes a
- * major bottleneck, especially when spilling to disk while decoding batch
- * workloads.
- */
-static const Size max_cached_tuplebufs = 4096 * 2;	/* ~8MB */
-
 /* ---
  * primary reorderbuffer support routines
  * ---
@@ -248,6 +245,10 @@ ReorderBufferAllocate(void)
 			SLAB_DEFAULT_BLOCK_SIZE,
 			sizeof(ReorderBufferTXN));
 
+	buffer->tup_context = GenerationContextCreate(new_ctx,
+		   "Tuples",
+		   SLAB_LARGE_BLOCK_SIZE);
+
 	hash_ctl.keysize = sizeof(TransactionId);
 	hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt);
 	hash_ctl.hcxt = buffer->context;
@@ -258,15 +259,12 

Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-09-24 Thread Tomas Vondra
Hi,

Apologies, I forgot to respond to the second part of your message.

On 09/06/2017 09:45 AM, Haribabu Kommi wrote:
>
> While testing this patch, I found another problem that is not related to
> this patch. When the vacuum command is executed mutiple times on
> a table with no dead rows, the number of reltuples value is slowly
> reducing.
> 
> postgres=# select reltuples, n_live_tup, n_dead_tup
>               from pg_stat_user_tables join pg_class using (relname)
>              where relname = 't';
>  reltuples | n_live_tup | n_dead_tup 
> ---++
>     899674 |     899674 |          0
> (1 row)
> 
> postgres=# vacuum t;
> VACUUM
> postgres=# select reltuples, n_live_tup, n_dead_tup
>               from pg_stat_user_tables join pg_class using (relname)
>              where relname = 't';
>  reltuples | n_live_tup | n_dead_tup 
> ---++
>     899622 |     899622 |          0
> (1 row)
> 
> postgres=# vacuum t;
> VACUUM
> postgres=# select reltuples, n_live_tup, n_dead_tup
>               from pg_stat_user_tables join pg_class using (relname)
>              where relname = 't';
>  reltuples | n_live_tup | n_dead_tup 
> ---++
>     899570 |     899570 |          0
> (1 row)
> 
> 
> In lazy_scan_heap() function, we force to scan the last page of the 
> relation to avoid the access exclusive lock in lazy_truncate_heap if
> there are tuples in the last page. Because of this reason, the 
> scanned_pages value will never be 0, so the vac_estimate_reltuples 
> function will estimate the tuples based on the number of tuples from
> the last page of the relation. This estimation is leading to reduce
> the number of retuples.
> 

Hmmm, that's annoying. Perhaps if we should not update the values in
this case, then? I mean, if we only scan the last page, how reliable the
derived values are?

For the record - AFAICS this issue is unrelated to do with the patch
(i.e. it's not introduced by it).

> I am thinking whether this problem really happen in real world 
> scenarios to produce a fix?
> 

Not sure.

As vacuum run decrements the query only a little bit, so you'd have to
run the vacuum many times to be actually bitten by it. For people
relying on autovacuum that won't happen, as it only runs on tables with
certain number of dead tuples.

So you'd have to be running VACUUM in a loop or something (but not
VACUUM ANALYZE, because that works fine) from a script, or something
like that.

That being said, fixing a bug is always a good thing I guess.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-09-24 Thread Tomas Vondra


On 09/06/2017 09:45 AM, Haribabu Kommi wrote:
> 
> 
> On Tue, Jul 25, 2017 at 9:33 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> On 7/25/17 12:55 AM, Tom Lane wrote:
> 
> Tomas Vondra <tomas.von...@2ndquadrant.com
> <mailto:tomas.von...@2ndquadrant.com>> writes:
> 
> It seems to me that VACUUM and ANALYZE somewhat disagree on what
> exactly reltuples means. VACUUM seems to be thinking that
> reltuples
> = live + dead while ANALYZE apparently believes that reltuples =
> live
> 
> 
> The question is - which of the reltuples definitions is the
> right
> one? I've always assumed that "reltuples = live + dead" but
> perhaps
> not?
> 
> 
> I think the planner basically assumes that reltuples is the live
> tuple count, so maybe we'd better change VACUUM to get in step.
> 
> 
> Attached is a patch that (I think) does just that. The disagreement
> was caused by VACUUM treating recently dead tuples as live, while
> ANALYZE treats both of those as dead.
> 
> At first I was worried that this will negatively affect plans in the
> long-running transaction, as it will get underestimates (due to
> reltuples not including rows it can see). But that's a problem we
> already have anyway, you just need to run ANALYZE in the other session.
> 
> 
> Thanks for the patch.
> From the mail, I understand that this patch tries to improve the
> reltuples value update in the catalog table by the vacuum command
> to consider the proper visible tuples similar like analyze command.
> 
> -num_tuples);
> +num_tuples - nkeep);
> 
> With the above correction, there is a problem in reporting the number
> of live tuples to the stats.
> 
> postgres=# select reltuples, n_live_tup, n_dead_tup
>               from pg_stat_user_tables join pg_class using (relname)
>              where relname = 't';
>  reltuples | n_live_tup | n_dead_tup 
> ---++
>     899818 |     799636 |     100182
> (1 row)
> 
> 
> The live tuples data value is again decremented with dead tuples
> value before sending them to stats in function lazy_vacuum_rel(),
> 
> /* report results to the stats collector, too */
> new_live_tuples = new_rel_tuples - vacrelstats->new_dead_tuples;
> 
> The fix needs a correction here also. Or change the correction in 
> lazy_vacuum_rel() function itself before updating catalog table similar
> like stats.
> 

Ah, haven't noticed that for some reason - you're right, we estimate the
reltuples based on (num_tuples - nkeep), so it doesn't make much sense
to subtract nkeep again. Attached is v2 of the fix.

I've removed the subtraction from lazy_vacuum_rel(), leaving just

new_live_tuples = new_rel_tuples;

and now it behaves as expected (no second subtraction). That means we
can get rid of new_live_tuples altogether (and the protection against
negative values), and use new_rel_tuples directly.

Which pretty much means that in this case

(pg_class.reltuples == pg_stat_user_tables.n_live_tup)

but I guess that's fine, based on the initial discussion in this thread.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 45b1859..1886f0d 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -198,7 +198,6 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	BlockNumber new_rel_pages;
 	double		new_rel_tuples;
 	BlockNumber new_rel_allvisible;
-	double		new_live_tuples;
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 
@@ -335,13 +334,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 		false);
 
 	/* report results to the stats collector, too */
-	new_live_tuples = new_rel_tuples - vacrelstats->new_dead_tuples;
-	if (new_live_tuples < 0)
-		new_live_tuples = 0;	/* just in case */
-
 	pgstat_report_vacuum(RelationGetRelid(onerel),
 		 onerel->rd_rel->relisshared,
-		 new_live_tuples,
+		 new_rel_tuples,
 		 vacrelstats->new_dead_tuples);
 	pgstat_progress_end_command();
 
@@ -1267,7 +1262,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
 		 nblocks,
 		 vacrelstats->tupcount_pages,
-		 num_tuples);
+		 num_tuples - nkeep);
 
 	/*
 	 * Release any remaining pin on visibility map page.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2017-09-23 Thread Tomas Vondra
On 09/21/2017 04:24 PM, Tom Lane wrote:
> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>> [ scalarineqsel may fall over when used by extension operators ]
> 
> I concur with your thought that we could have
> ineq_histogram_selectivity fall back to a "mid bucket" default if
> it's working with a datatype it is unable to convert_to_scalar. But I
> think if we're going to touch this at all, we ought to have higher
> ambition than that, and try to provide a mechanism whereby an
> extension that's willing to work a bit harder could get that
> additional increment of estimation accuracy. I don't care for this
> way to do that:
> 

The question is - do we need a solution that is back-patchable? This
means we can't really use FIXEDDECIMAL without writing effectively
copying a lot of the selfuncs.c stuff, only to make some minor fixes.

What about using two-pronged approach:

1) fall back to mid bucket in back branches (9.3 - 10)
2) do something smarter (along the lines you outlined) in PG11

I'm willing to spend some time on both, but (2) alone is not a
particularly attractive for us as it only helps PG11 and we'll have to
do the copy-paste evil anyway to get the data type working on back branches.

>> * Make convert_numeric_to_scalar smarter, so that it checks if
>> there is an implicit cast to numeric, and fail only if it does not
>> find one.
> 
> because it's expensive, and it only works for numeric-category cases,
> and it will fail outright for numbers outside the range of "double".
> (Notice that convert_numeric_to_scalar is *not* calling the regular
> cast function for numeric-to-double.) Moreover, any operator ought to
> know what types it can accept; we shouldn't have to do more catalog
> lookups to figure out what to do.
> 

Ah, I see. I haven't realized it's not using the regular cast functions,
but now that you point that out it's clear relying on casts would fail.

> Now that scalarltsel and friends are just trivial wrappers for a
> common function, we could imagine exposing scalarineqsel_wrapper as a
> non-static function, with more arguments (and perhaps a better-chosen
> name ;-)). The idea would be for extensions that want to go this
> extra mile to provide their own selectivity estimation functions,
> which again would just be trivial wrappers for the core function, but
> would provide additional knowledge through additional arguments.
> 
> The additional arguments I'm envisioning are a couple of C function 
> pointers, one function that knows how to convert the operator's 
> left-hand input type to scalar, and one function that knows how to
> convert the right-hand type to scalar. (Identical APIs of course.) 
> Passing a NULL would imply that the core code must fall back on its 
> own devices for that input.
> 
> Now the thing about convert_to_scalar is that there are several
> different conversion conventions already (numeric, string, timestamp,
> ...), and there probably could be more once extension types are
> coming to the party. So I'm imagining that the API for these
> conversion functions could be something like
> 
>   bool convert(Datum value, Oid valuetypeid,
>int *conversion_convention, double *converted_value);
> 
> The conversion_convention output would make use of some agreed-on 
> constants, say 1 for numeric, 2 for string, etc etc. In the core 
> code, if either converter fails (returns false) or if they don't 
> return the same conversion_convention code, we give up and use the 
> mid-bucket default. If they both produce results with the same 
> conversion_convention, then we can treat the converted_values as 
> commensurable.
> 

OK, so instead of re-implementing the whole function, we'd essentially
do just something like this:

#typedef bool (*convert_callback) (Datum value, Oid valuetypeid, \
   int *conversion_convention, \
   double *converted_value);

double
scalarineqsel(PlannerInfo *root, Oid operator,
 bool isgt, bool iseq, VariableStatData *vardata,
 Datum constval, Oid consttype,
 convert_calback convert);

and then, from the extension

double
scalarineqsel_wrapper(PlannerInfo *root, Oid operator,
 bool isgt, bool iseq, VariableStatData *vardata,
 Datum constval, Oid consttype)
{
return scalarineqsel(root, operator, isgt, iseq, vardata,
 constval, consttype, my_convert_func);
}

Sounds reasonable to me, I guess - I can't really think about anything
simpler giving us the same flexibility.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Boom filters for hash joins (was: A design for amcheck heapam verification)

2017-09-19 Thread Tomas Vondra
On 09/19/2017 06:03 PM, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 6:28 AM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> The patch is fairly simple, and did not try to push the bloom filters to
>> scan nodes or anything like that. It might be a meaningful first step,
>> though, particularly for selective joins (where only small number of
>> rows from the outer relation has a match in the hash table).
> 
> I believe that parallelism makes the use of Bloom filter a lot more 
> compelling, too. Obviously this is something that wasn't taken into 
> consideration in 2015.
> 

I haven't thought about it from that point of view. Can you elaborate
why that would be the case? Sorry if this was explained earlier in this
thread (I don't see it in the history, though).

I can't quite remember why I haven't pursued the patch in 2015, but it
was probably clear it wouldn't get in in the last CF, and I never got
back to it.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Boom filters for hash joins (was: A design for amcheck heapam verification)

2017-09-19 Thread Tomas Vondra
Hi,

On 09/19/2017 02:55 AM, Robert Haas wrote:
> On Mon, Sep 18, 2017 at 5:13 PM, Peter Geoghegan <p...@bowt.ie> wrote:
>> On Mon, Sep 18, 2017 at 2:07 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>> On Mon, Sep 18, 2017 at 1:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>> Uh, why does the planner need to be involved at all?
>>>
>>> Because it loses if the Bloom filter fails to filter anything.  That's
>>> not at all far-fetched; consider SELECT * FROM a.x, b.x WHERE a.x =
>>> b.x given a foreign key on a.x referencing b(x).
>>
>> Wouldn't a merge join be a lot more likely in this case anyway? Low
>> selectivity hash joins with multiple batches are inherently slow; the
>> wasted overhead of using a bloom filter may not matter.
>>
>> Obviously this is all pretty speculative. I suspect that this could be
>> true, and it seems worth investigating that framing of the problem
>> first.
> 
> ISTR Tomas Vondra doing some experiments with this a few years ago and
> finding that it was, in fact, a problem.
> 

You seem to have better memory than me, but you're right - I did some
experiments with this in 2015, the WIP patch and discussion is here:

  https://www.postgresql.org/message-id/5670946e.8070...@2ndquadrant.com

The whole idea was that with a bloom filter we can reduce the amount of
tuples (from the outer relation) written to batches.

The patch is fairly simple, and did not try to push the bloom filters to
scan nodes or anything like that. It might be a meaningful first step,
though, particularly for selective joins (where only small number of
rows from the outer relation has a match in the hash table).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] user-defined numeric data types triggering ERROR: unsupported type

2017-09-19 Thread Tomas Vondra
Hi,

while testing a custom data type FIXEDDECIMAL [1], implementing a
numeric-like data type with limited range, I ran into a several issues
that I suspect may not be entirely intentional / expected behavior.

[1] https://github.com/2ndQuadrant/fixeddecimal

Attached is a minimal subset of the extension SQL definition, which may
be more convenient when looking into the issue.

The most important issue is that when planning a simple query, the
estimation of range queries on a column with the custom data type fails
like this:

test=# create table t (a fixeddecimal);
CREATE TABLE
test=# insert into t select random() from generate_series(1,10);
INSERT 0 10
test=# analyze t;
ANALYZE
test=# select * from t where a > 0.9;
ERROR:  unsupported type: 16385

The error message here comes from convert_numeric_to_scalar, which gets
called during histogram processing (ineq_histogram_selectivity) when
approximating the histogram. convert_to_scalar does this:

switch (valuetypeid)
{
  ...
  case NUMERICOID:
  ...
*scaledvalue = convert_numeric_to_scalar(value, valuetypid);
*scaledlobound = convert_numeric_to_scalar(lobound, boundstypid);
*scaledhibound = convert_numeric_to_scalar(hibound, boundstypid);
return true;

  ...
}

The first call works fine, as the constant really is numeric
(valuetypeid=1700). But the histogram boundaries are using the custom
data type, causing the error as convert_numeric_to_scalar expects only a
bunch of hard-coded data types. So it's pretty much guaranteed to fail
with any user-defined data type.

This seems a bit unfortunate :-(

One solution would be to implement custom estimation function, replacing
scalarltsel/scalargtsel. But that seems rather unnecessary, especially
considering there is an implicit cast from fixeddecimal to numeric.
Another thing is that when there's just an MCV, the estimation works
just fine.

So I see two basic ways to fix this:

* Make convert_numeric_to_scalar smarter, so that it checks if there is
an implicit cast to numeric, and fail only if it does not find one.

* Make convert_to_scalar smarter, so that it does return false for
unexpected data types, so that ineq_histogram_selectivity uses the
default estimate of 0.5 (for that one bucket).

Both options seem more favorable than what's happening currently. Is
there anything I missed, making those fixes unacceptable?

If anything, the fact that MCV estimates work while histogram does not
makes this somewhat unpredictable - a change in the data distribution
(or perhaps even just a different sample in ANALYZE) may result in
sudden failures.


I ran into one additional strange thing while investigating this. The
attached SQL script defines two operator classes - fixeddecimal_ops and
fixeddecimal_numeric_ops, defining (fixeddecimal,fixeddecimal) and
(fixeddecimal,numeric) operators. Dropping one of those operator classes
changes the estimates in a somewhat suspicious ways.

When only fixeddecimal_ops is defined, we get this:

test=# explain select * from t where a > 0.1;
   QUERY PLAN

 Seq Scan on t  (cost=0.00..1943.00 rows=3 width=8)
   Filter: ((a)::numeric > 0.1)
(2 rows)

That is, we get the default estimate for inequality clauses, 33%. But
when only fixeddecimal_numeric_ops, we get this:

test=# explain select * from t where a > 0.1;
   QUERY PLAN

 Seq Scan on t  (cost=0.00..1693.00 rows=5 width=8)
   Filter: (a > 0.1)
(2 rows)

That is, we get 50% estimate, because that's what scalarineqsel uses
when it ineq_histogram_selectivity can't compute selectivity from a
histogram for some reason.

Wouldn't it make it more sense to use the default 33% estimate here?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


fixeddecimal-minimal.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] valgrind vs. shared typmod registry

2017-09-16 Thread Tomas Vondra
Hi,

I've been running some regression tests under valgrind, and it seems
select_parallel triggers some uses of uninitialized values in dshash. If
I'm reading the reports right, it complains about hashtable->size_log2
being not being initialized in ensure_valid_bucket_pointers.

I've been running tests under valgrind not too long ago and I don't
recall such failures, so perhaps something broke it in the past few days.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
==18897== Conditional jump or move depends on uninitialised value(s)
==18897==at 0x6C0B4F: ensure_valid_bucket_pointers (dshash.c:752)
==18897==by 0x6C0123: dshash_find (dshash.c:390)
==18897==by 0x971224: find_or_make_matching_shared_tupledesc (typcache.c:2265)
==18897==by 0x9704E7: assign_record_type_typmod (typcache.c:1602)
==18897==by 0x68879E: BlessTupleDesc (execTuples.c:1036)
==18897==by 0x6721C8: ExecInitExprRec (execExpr.c:1512)
==18897==by 0x671E62: ExecInitExprRec (execExpr.c:1385)
==18897==by 0x66FC87: ExecInitExpr (execExpr.c:130)
==18897==by 0x15AB9E92: exec_eval_simple_expr (pl_exec.c:5584)
==18897==by 0x15AB95E9: exec_eval_expr (pl_exec.c:5202)
==18897==by 0x15AB4388: exec_stmt_return (pl_exec.c:2755)
==18897==by 0x15AB2563: exec_stmt (pl_exec.c:1606)
==18897==by 0x15AB2306: exec_stmts (pl_exec.c:1521)
==18897==by 0x15AB21B1: exec_stmt_block (pl_exec.c:1459)
==18897==by 0x15AB046B: plpgsql_exec_function (pl_exec.c:464)
==18897==by 0x15AAACCD: plpgsql_call_handler (pl_handler.c:258)
==18897==by 0x674C7E: ExecInterpExpr (execExprInterp.c:650)
==18897==by 0x6AA6A0: ExecEvalExprSwitchContext (executor.h:309)
==18897==by 0x6AA709: ExecProject (executor.h:343)
==18897==by 0x6AA878: ExecResult (nodeResult.c:136)
==18897==  Uninitialised value was created by a heap allocation
==18897==at 0x9A837E: palloc (mcxt.c:871)
==18897==by 0x6BFEEA: dshash_attach (dshash.c:269)
==18897==by 0x9707F7: SharedRecordTypmodRegistryAttach (typcache.c:1778)
==18897==by 0x484826: AttachSession (session.c:183)
==18897==by 0x504225: ParallelWorkerMain (parallel.c:1100)
==18897==by 0x7820B5: StartBackgroundWorker (bgworker.c:835)
==18897==by 0x793F9E: do_start_bgworker (postmaster.c:5686)
==18897==by 0x7942D7: maybe_start_bgworkers (postmaster.c:5890)
==18897==by 0x793399: sigusr1_handler (postmaster.c:5079)
==18897==by 0x4E4B5CF: ??? (in /usr/lib64/libpthread-2.24.so)
==18897==by 0x586AC72: __select_nocancel (in /usr/lib64/libc-2.24.so)
==18897==by 0x78F220: ServerLoop (postmaster.c:1720)
==18897==by 0x78E9C5: PostmasterMain (postmaster.c:1364)
==18897==by 0x6D5714: main (main.c:228)
==18897== 
{
   
   Memcheck:Cond
   fun:ensure_valid_bucket_pointers
   fun:dshash_find
   fun:find_or_make_matching_shared_tupledesc
   fun:assign_record_type_typmod
   fun:BlessTupleDesc
   fun:ExecInitExprRec
   fun:ExecInitExprRec
   fun:ExecInitExpr
   fun:exec_eval_simple_expr
   fun:exec_eval_expr
   fun:exec_stmt_return
   fun:exec_stmt
   fun:exec_stmts
   fun:exec_stmt_block
   fun:plpgsql_exec_function
   fun:plpgsql_call_handler
   fun:ExecInterpExpr
   fun:ExecEvalExprSwitchContext
   fun:ExecProject
   fun:ExecResult
}
==18897== Conditional jump or move depends on uninitialised value(s)
==18897==at 0x6C0B4F: ensure_valid_bucket_pointers (dshash.c:752)
==18897==by 0x6C024E: dshash_find_or_insert (dshash.c:440)
==18897==by 0x971310: find_or_make_matching_shared_tupledesc (typcache.c:2294)
==18897==by 0x9704E7: assign_record_type_typmod (typcache.c:1602)
==18897==by 0x68879E: BlessTupleDesc (execTuples.c:1036)
==18897==by 0x6721C8: ExecInitExprRec (execExpr.c:1512)
==18897==by 0x671E62: ExecInitExprRec (execExpr.c:1385)
==18897==by 0x66FC87: ExecInitExpr (execExpr.c:130)
==18897==by 0x15AB9E92: exec_eval_simple_expr (pl_exec.c:5584)
==18897==by 0x15AB95E9: exec_eval_expr (pl_exec.c:5202)
==18897==by 0x15AB4388: exec_stmt_return (pl_exec.c:2755)
==18897==by 0x15AB2563: exec_stmt (pl_exec.c:1606)
==18897==by 0x15AB2306: exec_stmts (pl_exec.c:1521)
==18897==by 0x15AB21B1: exec_stmt_block (pl_exec.c:1459)
==18897==by 0x15AB046B: plpgsql_exec_function (pl_exec.c:464)
==18897==by 0x15AAACCD: plpgsql_call_handler (pl_handler.c:258)
==18897==by 0x674C7E: ExecInterpExpr (execExprInterp.c:650)
==18897==by 0x6AA6A0: ExecEvalExprSwitchContext (executor.h:309)
==18897==by 0x6AA709: ExecProject (executor.h:343)
==18897==by 0x6AA878: ExecResult (nodeResult.c:136)
==18897==  Uninitialised value was created by a heap allocation
==18897==at 0x9A837E: palloc (mcxt.c:871)
==18897==by 0x6BFEEA: dshash_attach (dshash.c:269)
==18897==by 0x970820: SharedRecordTypmodRegistryAttach (typcache.c:1782)
==18897==by 0x484826: AttachSession (session.c:183)

Re: [HACKERS] PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)

2017-09-14 Thread Tomas Vondra
On 09/14/2017 04:21 PM, Simon Riggs wrote:
> On 14 August 2017 at 01:35, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
>> Hi,
>>
>> Attached is a rebased version of the Generational context, originally
>> submitted with SlabContext (which was already committed into Pg 10).
>>
>> The main change is that I've abandoned the pattern of defining a Data
>> structure and then a pointer typedef, i.e.
>>
>> typedef struct GenerationContextData { ... } GenerationContextData;
>> typedef struct GenerationContextData *GenerationContext;
>>
>> Now it's just
>>
>> typedef struct GenerationContext { ... } GenerationContext;
>>
>> mostly because SlabContext was committed like that, and because Andres was
>> complaining about this code pattern ;-)
>>
>> Otherwise the design is the same as repeatedly discussed before.
>>
>> To show that this is still valuable change (even after SlabContext and
>> adding doubly-linked list to AllocSet), I've repeated the test done by
>> Andres in [1] using the test case described in [2], that is
>>
>>   -- generate data
>>   SELECT COUNT(*) FROM (SELECT test1()
>>   FROM generate_series(1, 5)) foo;
>>
>>   -- benchmark (measure time and VmPeak)
>>   SELECT COUNT(*) FROM (SELECT *
>>   FROM pg_logical_slot_get_changes('test', NULL,
>> NULL, 'include-xids', '0')) foo;
>>
>> with different values passed to the first step (instead of the 5). The
>> VmPeak numbers look like this:
>>
>>  N   masterpatched
>> --
>> 10   1155220 kB  361604 kB
>> 20   2020668 kB  434060 kB
>> 30   2890236 kB  502452 kB
>> 40   3751592 kB  570816 kB
>> 50   4621124 kB  639168 kB
>>
>> and the timing (on assert-enabled build):
>>
>>  N   masterpatched
>> --
>> 10  1103.182 ms 412.734 ms
>> 20  2216.711 ms 820.438 ms
>> 30  3320.095 ms1223.576 ms
>> 40  4584.919 ms1621.261 ms
>> 50  5590.444 ms2113.820 ms
>>
>> So it seems it's still a significant improvement, both in terms of memory
>> usage and timing. Admittedly, this is a single test, so ideas of other
>> useful test cases are welcome.
> 
> This all looks good.
> 
> What I think this needs is changes to
>src/backend/utils/mmgr/README
> which decribe the various options that now exist (normal?, slab) and
> will exist (generational)
> 
> Don't really care about the name, as long as its clear when to use it
> and when not to use it.
> 
> This description of generational seems wrong: "When the allocated
> chunks have similar lifespan, this works very well and is extremely
> cheap."
> They don't need the same lifespan they just need to be freed in groups
> and in the order they were allocated.
> 

Agreed, should be described differently. What matters is (mostly) FIFO
pattern of the palloc/pfree requests, which allows us to release the memory.

>
> For this patch specifically, we need additional comments in 
> reorderbuffer.c to describe the memory allocation pattern in that 
> module so that it is clear that the choice of allocator is useful
> and appropriate, possibly with details of how that testing was
> performed, so it can be re-tested later or tested on a variety of
> platforms.
> 

Including details about the testing into reorderbuffer.c does not seem
very attractive to me. I don't recall any other place describing the
tests in detail. Why not to discuss the details here, and then include a
link to this thread in the commit message?

In any case, I doubt anyone will repeat the testing on a variety of
platforms (and I don't have any such comprehensive test suite anyway).
What will likely happen is someone bumping into a poorly performing
corner case, we will analyze it and fix it as usual.

>
> Particularly in reorderbuffer, surely we will almost immediately
> reuse chunks again, so is it worth issuing free() and then malloc()
> again soon after? Does that cause additional overhead we could also
> avoid? Could we possibly keep the last/few free'd chunks around
> rather than re-malloc?
> 

I haven't seen anything like that in tests I've done. The malloc/free
overhead is negligible thanks as our allocators significantly reduce the
number of calls to those functions. If we happen to run into such case,
it shouldn't be difficult to keep a few empty blocks. But perhaps we can
leave that as a future optimization.

>
> Seems like we should commit this soon.
> 

Thanks.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hooks to track changed pages for backup purposes

2017-09-13 Thread Tomas Vondra


On 09/13/2017 07:53 AM, Andrey Borodin wrote:
>> * I see there are conditions like this:
>>
>>    if(xlogreader->blocks[nblock].forknum == MAIN_FORKNUM)
>>
>> Why is it enough to restrict the block-tracking code to main fork?
>> Aren't we interested in all relation forks?
> fsm, vm and others are small enough to take them 
> 

That seems like an optimization specific to your backup solution, not
necessarily to others and/or to other possible use cases.

>> I guess you'll have to explain
>> what the implementation of the hooks is supposed to do, and why these
>> locations for hook calls are the right ones. It's damn impossible to
>> validate the patch without that information.
>>
>> Assuming you still plan to use the hook approach ...
> Yes, I still think hooking is good idea, but you are right - I need
> prototype first. I'll mark patch as Returned with feedback before
> prototype implementation.
> 

OK

>>
>>>> There
>>>> are no arguments fed to this hook, so modules would not be able to
>>>> analyze things in this context, except shared memory and process
>>>> state?
>>>
>>>>
>>>> Those hooks are put in hot code paths, and could impact performance of
>>>> WAL insertion itself.
>>> I do not think sending few bytes to cached array is comparable to disk
>> write of XLog record. Checking the func ptr is even cheaper with correct
>> branch prediction.
>>>
>>
>> That seems somewhat suspicious, for two reasons. Firstly, I believe we
>> only insert the XLOG records into WAL buffer here, so why should there
>> be any disk write related? Or do you mean the final commit?
> Yes, I mean finally we will be waiting for disk. Hundred empty ptr
> checks are neglectable in comparision with disk.

Aren't we doing these calls while holding XLog locks? IIRC there was
quite a significant performance improvement after Heikki reduced the
amount of code executed while holding the locks.

>>
>> But more importantly, doesn't this kind of information require some
>> durability guarantees? I mean, if it gets lost during server crashes or
>> restarts, doesn't that mean the incremental backups might miss some
>> buffers? I'd guess the hooks will have to do some sort of I/O, to
>> achieve that, no?
> We need durability only on the level of one segment. If we do not have
> info from segment we can just rescan it.
> If we send segment to S3 as one file, we are sure in it's integrity. But
> this IO can by async.
> 
> PTRACK in it's turn switch bits in fork's buffers which are written in
> checkpointer and..well... recovered during recovery. By usual WAL replay
> of recovery.
> 

But how do you do that from the hooks, if they only store the data into
a buffer in memory? Let's say you insert ~8MB of WAL into a segment, and
then the system crashes and reboots. How do you know you have incomplete
information from the WAL segment?

Although, that's probably what wal_switch_hook() might do - sync the
data whenever the WAL segment is switched. Right?

> 
>> From this POV, the idea to collect this information on the backup system
>> (WAL archive) by pre-processing the arriving WAL segments seems like the
>> most promising. It moves the work to another system, the backup system
>> can make it as durable as the WAL segments, etc.
> 
> Well, in some not so rare cases users encrypt backups and send to S3.
> And there is no system with CPUs that can handle that WAL parsing.
> Currently, I'm considering mocking prototype for wal-g, which works
> exactly this.
> 

Why couldn't there be a system with enough CPU power? Sure, if you want
to do this, you'll need a more powerful system, but regular CPUs can do
>1GB/s in AES-256-GCM thanks to AES-NI. Or you could do it on the
database as part of archive_command, before the encryption, of course.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patches that don't apply or don't compile: 2017-09-12

2017-09-13 Thread Tomas Vondra
Hi Aleksander,

On 09/13/2017 11:49 AM, Aleksander Alekseev wrote:
> Hi Tomas,
> 
> I appreciate your feedback, although it doesn't seem to be completely
> fair. Particularly:
> 
>> You gave everyone about 4 hours to object
> 
> This is not quite accurate since my proposal was sent 2017-09-11
> 09:41:32 and this thread started - 2017-09-12 14:14:55.
> 

Understood. I didn't really consider the first message to be a proposal
with a deadline, as it starts with "here's a crazy idea" and it's not
immediately clear that you intend to proceed with it immediately, and
that you expect people to object.

The message I referenced is a much clearer on this.

>> You just changed the status of 10-15% open patches. I'd expect
>> things like this to be consulted with the CF manager, yet I don't see
>> any comments from Daniel.
> 
> Agree, that was clearly a mistake, I had to add Daniel to CC. Sorry I
> didn't do that. I've returned all affected patches back to "Needs
> Review". On the bright side while doing this I've noticed that many
> patches were already updated by their authors.
> 

Yeah.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patches that don't apply or don't compile: 2017-09-12

2017-09-12 Thread Tomas Vondra
On 09/12/2017 04:14 PM, Aleksander Alekseev wrote:
> Hello, hackers!
> 
> Thanks to the work of Thomas Munro now we have a CI for the patches on
> the commitfest [1]. Naturally there is still room for improvement, but
> in any case it's much, much better than nothing.
> 
> After a short discussion [2] we agreed (or at least no one objected)
> to determine the patches that don't apply / don't compile / don't
> pass regression tests and have "Needs Review" status, change the
> status of these patches to "Waiting on Author" and write a report
> (this one) with a CC to the authors. As all we know, we are short on
> reviewers and this action will save them a lot of time. Here [3] you
> can find a script I've been using to find such patches.
> 
> I rechecked the list manually and did my best to exclude the patches 
> that were updated recently or that depend on other patches. However 
> there still a chance that your patch got to the list by a mistake.
> In this case please let me know.>

With all due respect, it's hard not to see this as a disruption of the
current CF. I agree automating the patch processing is a worthwhile
goal, but we're not there yet and it seems somewhat premature.

Let me explain why I think so:

(1) You just changed the status of 10-15% open patches. I'd expect
things like this to be consulted with the CF manager, yet I don't see
any comments from Daniel. Considering he's been at the Oslo PUG meetup
today, I doubt he was watching hackers very closely.

(2) You gave everyone about 4 hours to object, ending 3PM UTC, which
excludes about one whole hemisphere where it's either too early or too
late for people to respond. I'd say waiting for >24 hours would be more
appropriate.

(3) The claim that "on one objected" is somewhat misleading, I guess. I
myself objected to automating this yesterday, and AFAICS Thomas Munro
shares the opinion that we're not ready for automating it.

(4) You say you rechecked the list manually - can you elaborate what you
checked? Per reports from others, some patches seem to "not apply"
merely because "git apply" is quite strict. Have you actually tried
applying / compiling the patches yourself?

(5) I doubt "does not apply" is actually sufficient to move the patch to
"waiting on author". For example my statistics patch was failing to
apply merely due to 821fb8cdbf lightly touching the SGML docs, changing
"type" to "kind" on a few places. Does that mean the patch can't get any
reviews until the author fixes it? Hardly. But after switching it to
"waiting on author" that's exactly what's going to happen, as people are
mostly ignoring patches in that state.

(6) It's generally a good idea to send a message the patch thread
whenever the status is changed, otherwise the patch authors may not
notice the change for a long time. I don't see any such messages,
certainly not in "my" patch thread.

I object to changing the patch status merely based on the script output.
It's a nice goal, but we need to do the legwork first, otherwise it'll
be just annoying and disrupting.

I suggest we inspect the reported patches manually, investigate whether
the failures are legitimate or not, and eliminate as many false
positives as possible. Once we are happy with the accuracy, we can
enable it again.


kind regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-09-12 Thread Tomas Vondra
Attached is an updated version of the patch, dealing with fallout of
821fb8cdbf700a8aadbe12d5b46ca4e61be5a8a8 which touched the SGML
documentation for CREATE STATISTICS.

regards

On 09/07/2017 10:07 PM, Tomas Vondra wrote:
> Hi,
> 
> Attached is an updated version of the patch, fixing the issues reported
> by Adrien Nayrat, and also a bunch of issues pointed out by valgrind.
> 
> regards
> 

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-multivariate-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The case for removing replacement selection sort

2017-09-11 Thread Tomas Vondra
On 09/11/2017 05:32 PM, Robert Haas wrote:
> On Sun, Sep 10, 2017 at 9:39 PM, Peter Geoghegan <p...@bowt.ie> wrote:
>> To be clear, you'll still need to set replacement_sort_tuples high
>> when testing RS, to make sure that we really use it for at least the
>> first run when we're expected to. (There is no easy way to have
>> testing mechanically verify that we really do only have one run in the
>> end with RS, but I assume that such paranoia is unneeded.)
> 
> I seem to recall that raising replacement_sort_tuples makes
> replacement selection look worse in some cases -- the optimization is
> more likely to apply, sure, but the heap is also bigger, which hurts.
> 

The question is what is the optimal replacement_sort_tuples value? I
assume it's the number of tuples that effectively uses CPU caches, at
least that's what our docs say. So I think you're right it to 1B rows
may break this assumption, and make it perform worse.

But perhaps the fact that we're testing with multiple work_mem values,
and with smaller data sets (100k or 1M rows) makes this a non-issue?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Automatic testing of patches in commit fest

2017-09-11 Thread Tomas Vondra


On 09/11/2017 03:01 PM, Aleksander Alekseev wrote:
> Hi Tomas,
> 
>>> Unless there are any objections to give this idea a try I'm willing to
>>> write and host a corresponding script.
>>>
>> That won't work until (2) is reliable enough. There are patches
>> (for example my "multivariate MCV lists and histograms") which
>> fails to apply only because the tool picks the wrong patch.
>> Possibly because it does not recognize compressed patches, or
>> something.
> 
> Agree. However we could simply add an "Enable autotests" checkbox to
> the commitfest application. In fact we could even left the
> commitfest application as is and provide corresponding checkboxes in
> the web interface of the "autoreviewer" application. Naturally every 
> automatically generated code review will include a link that
> disables autotests for this particular commitfest entry.
> 

I think we should try making it as reliable as possible first, and only
then consider adding some manual on/off switches. Otherwise the patches
may randomly flap between "OK" and "failed" after each new submission,
disrupting the CF process.

Making the testing reliable may even require establishing some sort of
policy (say, always send a complete patch, not just one piece).

>
> I hope this observation will change your mind :)
> 

Not sure. But I'm mostly just a passenger here, not the driver.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-09-11 Thread Tomas Vondra
On 09/11/2017 01:54 PM, Robert Haas wrote:
> On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev
> <i.kurbangal...@postgrespro.ru> wrote:
>> Moreover, RUM index
>> stores positions + lexemes, so it doesn't need tsvectors for ranked
>> search. As a result, tsvector becomes a storage for
>> building indexes (indexable type), not something that should be used at
>> runtime. And the change of the format doesn't affect index creation
>> time.
> 
> RUM indexes, though, are not in core.
> 

Yeah, but I think Ildus has a point that this should not really matter
on indexed tsvectors. So the question is how realistic that benchmark
actually is. How likely are we to do queries on fts directly, not
through a GIN/GiST index? Particularly in performance-sensitive cases?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Automatic testing of patches in commit fest

2017-09-11 Thread Tomas Vondra

On 09/11/2017 11:41 AM, Aleksander Alekseev wrote:
> Hi Thomas,
> 
> Great job!
> 

+1

> Here is a crazy idea. What if we write a script that would automatically
> return the patches that:
> 
> 1) Are not in "Waiting on Author" status
> 2) Don't apply OR don't pass `make installcheck-world`
> 
> ... to the "Waiting on Author" status and describe the problem through
> the "Add review" form on commitfest.postgresql.org? This will save a lot
> of time to the reviewers. Naturally nobody wants to spam pgsql-hackers@
> with automatic messages to often. I believe that sending such messages
> once a day would be enough.
> 
> Unless there are any objections to give this idea a try I'm willing to
> write and host a corresponding script.
> 

That won't work until (2) is reliable enough. There are patches (for
example my "multivariate MCV lists and histograms") which fails to apply
only because the tool picks the wrong patch. Possibly because it does
not recognize compressed patches, or something.

In such cases switching it to "Waiting on Author" automatically would be
damaging, as (a) there's nothing wrong with the patch, and (b) it's not
clear what to do to fix the problem.

So -1 to this for now, until we make this part smart enough.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Polyphase merge is obsolete

2017-09-11 Thread Tomas Vondra
Hi,

I planned to do some benchmarking on this patch, but apparently the
patch no longer applies. Rebase please?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The case for removing replacement selection sort

2017-09-10 Thread Tomas Vondra
On 09/11/2017 02:22 AM, Peter Geoghegan wrote:
> On Sun, Sep 10, 2017 at 5:07 PM, Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
>> I'm currently re-running the benchmarks we did in 2016 for 9.6, but
>> those are all sorts with a single column (see the attached script). But
>> it'd be good to add a few queries testing sorts with multiple keys. We
>> can either tweak some of the existing data sets + queries, or come up
>> with entirely new tests.
> 
> I see that work_mem is set like this in the script:
> 
> "for wm in '1MB' '8MB' '32MB' '128MB' '512MB' '1GB'; do"
> 
> I suggest that we forget about values over 32MB, since the question of
> how well quicksort does there was settled by your tests in 2016. I
> would also add '4MB' to the list of wm values that you'll actually
> test.

OK, so 1MB, 4MB, 8MB, 32MB?

> 
> Any case with input that is initially in random order or DESC sort 
> order is not interesting, either. I suggest you remove those, too.
> 

OK.

> I think we're only interested in benchmarks where replacement
> selection really does get its putative best case (no merge needed in
> the end). Any (almost) sorted cases (the only cases that you are
> interesting to test now) will always manage that, once you set
> replacement_sort_tuples high enough, and provided there isn't even a
> single tuple that is completely out of order. The "before" cases here
> should have a replacement_sort_tuples of 1 billion (so that we're sure
> to not have the limit prevent the use of replacement selection in the
> first place), versus the "after" cases, which should have a
> replacement_sort_tuples of 0 to represent my proposal (to represent
> performance in a world where replacement selection is totally
> removed).
> 

Ah, so you suggest doing all the tests on current master, by only
tweaking the replacement_sort_tuples value? I've been testing master vs.
your patch, but I guess setting replacement_sort_tuples=0 should have
the same effect.

I probably won't eliminate the random/DESC data sets, though. At least
not from the two smaller data sets - I want to do a bit of benchmarking
on Heikki's polyphase merge removal patch, and for that patch those data
sets are still relevant. Also, it's useful to have a subset of results
where we know we don't expect any change.

>> For the existing queries, I should have some initial results
>> tomorrow, at least for the data sets with 100k and 1M rows. The
>> tests with 10M rows will take much more time (it takes 1-2hours for
>> a single work_mem value, and we're testing 6 of them).
> 
> I myself don't see that much value in a 10M row test.
> 

Meh, more data is probably better. And with the reduced work_mem values
and skipping of random/DESC data sets it should complete much faster.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The case for removing replacement selection sort

2017-09-10 Thread Tomas Vondra

On 09/11/2017 01:03 AM, Peter Geoghegan wrote:
> On Sat, Sep 9, 2017 at 8:28 AM, Greg Stark <st...@mit.edu> wrote:
>> This may be a bit "how long is a piece of string" but how do those two
>> compare with string sorting in an interesting encoding/locale -- say
>> /usr/share/dict/polish in pl_PL for example. It's certainly true that
>> people do sort text as well as numbers.
> 
> I haven't looked at text, because I presume that it's very rare for 
> tuples within a table to be more or less ordered by a text
> attribute. While the traditional big advantage of replacement
> selection has always been its ability to double initial run length on
> average, where text performance is quite interesting because
> localized clustering still happens, that doesn't really seem relevant
> here. The remaining use of replacement selection is expressly only
> about the "single run, no merge" best case, and in particular about
> avoiding regressions when upgrading from versions prior to 9.6 (9.6
> is the version where we began to generally prefer quicksort).
> 
>> Also, people often sort on keys of more than one column
> 
> Very true. I should test this.
> 

I'm currently re-running the benchmarks we did in 2016 for 9.6, but
those are all sorts with a single column (see the attached script). But
it'd be good to add a few queries testing sorts with multiple keys. We
can either tweak some of the existing data sets + queries, or come up
with entirely new tests.

The current tests construct data sets with different features (unique or
low/high-cardinality, random/sorted/almost-sorted). How should that work
for multi-key sorts? Same features for all columns, or some mix?

For the existing queries, I should have some initial results tomorrow,
at least for the data sets with 100k and 1M rows. The tests with 10M
rows will take much more time (it takes 1-2hours for a single work_mem
value, and we're testing 6 of them).

But perhaps we can reduce the number of tests somehow, only testing the
largest/smallest work_mem values? So that we could get some numbers now,
possibly running the complete test suite later.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


sort-bench.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Separate log file for extension

2017-09-09 Thread Tomas Vondra


On 08/28/2017 11:23 AM, Antonin Houska wrote:
> Craig Ringer <cr...@2ndquadrant.com> wrote:
> 
>> On 25 August 2017 at 15:12, Antonin Houska <a...@cybertec.at> wrote:
>>
>> How will this play with syslog? csvlog? etc?
> 
> There's nothing special about csvlog: the LogStream structure has a
> "destination" field, so if particular extension wants this kind of output, it
> simply sets the LOG_DESTINATION_CSVLOG bit here.
> 

I assume Craig's point was more about CSVLOG requiring log_collector=on.
So what will happen if the PostgreSQL is started without the collector,
and an extension attempts to use LOG_DESTINATION_CSVLOG? Or will it
start a separate collector for each such extension?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-09-09 Thread Tomas Vondra


On 08/30/2017 03:16 AM, Andres Freund wrote:
> On 2017-08-30 10:14:22 +0900, Michael Paquier wrote:
>> On Wed, Aug 30, 2017 at 10:06 AM, Andres Freund <and...@anarazel.de> wrote:
>>> On 2017-08-30 09:49:14 +0900, Michael Paquier wrote:
>>>> Do you think that we should worry about wal segment sizes higher than
>>>> 2GB? Support for int64 GUCs is not here yet.
>>>
>>> 1GB will be the limit anyway.
>>
>> Yeah, but imagine that we'd want to raise that even more up.
> 
> I'm doubtfull of that. But even if, it'd not be hard to GUC support.
> 

It's not hard - it's just a lot of copy-pasting of infrastructure code.
Incidentally, I already have a patch doing that, as we had to add that
support to XL, and I can submit it to PostgreSQL. Might save some boring
coding.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: new help related to variables are not too readable

2017-09-09 Thread Tomas Vondra
On 09/09/2017 01:24 AM, Tom Lane wrote:
> Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
>> The translator has exactly the same context in both cases, and the
>> struggle to wrap it at 80 characters will be pretty much the same too.
> 
> Really?  With the old way, you had something under 60 characters to
> work in, now it's nearly 80.  I don't buy that that's not a significant
> difference.  It's also much less ugly if you decide you need one more
> line than the English version uses.
> 

That's not what I meant. I've never felt a strong urge to avoid wrapping
at 80 chars when translating strings - simply translate in the most
meaningful way, if it gets longer than 80 chars and can't be easily
shortened, just wrap. If there are 60 or 80 characters does not change
this much - 80 chars may allow more unwrapped strings, of course, but
it's a minor change for the translators. Or at least me, others may
disagree, of course.

What I find way more annoying are strings where it's not clear where to
wrap, because it gets prefixed by something, we insert a value while
formatting the error message, etc. But that's not the case here, as both

  _(" "
"  ")

and

  _(" ")

give you the whole string.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: new help related to variables are not too readable

2017-09-08 Thread Tomas Vondra
Hi,

On 09/08/2017 07:25 AM, Fabien COELHO wrote:
> 
> Hello,
> 
>>>   PSQL_HISTORY   alternative location for the command history file
>>>
>>> I would prefer to revert to that more compact 9.6-formatting.
>>
>> There was a problem with line width .. its hard to respect 80 chars
> 
> Yes.
> 
> Scrolling in two dimensions because it does not fit either way is not
> too practical, so the idea was the it should at least fit a reasonable
> terminal in the horizontal dimension, the vertical one having been
> unfittable anyway for a long time.
> 
> Once you want to do strict 80 columns, a lot of/most descriptions do not
> fit and need to be split somehow on two lines, one way or another. It
> seemed that
> 
>   XXX
>    xxx xx xxx xxx 
> 
> Is a good way to do that systematically and with giving more space and
> chance for a description to fit in its line. ISTM that it was already
> done like for environment variables, so it is also for homogeneity.
> 

FWIW I also find the new formatting hard to read, as it does not clearly
separate the items. The old formatting was not ideal either, though.

>
> It also simplify work for translators that can just follow the same
> formatting and know what to do if a one line English explanation does
> not fit once translated.
> 

As someone responsible for a significant part of the Czech translation,
I find this argument somewhat dubious. I don't see how this

  _(" "
"  ")

simplifies the work for translators, compared to this

  _(" ")

The translator has exactly the same context in both cases, and the
struggle to wrap it at 80 characters will be pretty much the same too.

The thing that would make a difference is automatic wrapping, i.e.
split on spaces, then re-build into shorter than 80 characters ...

> Finally, as vertical scrolling is mandatory, I would be fine with
> skipping lines with entries for readability, but it is just a matter of
> taste and I expect there should be half a dozen different opinions on
> the matter of formatting.
> 

FWIW, +1 to extra lines from me - I find it way more readable, as it
clearly separates the items. You're right this is also a matter of taste
to some degree, so I understand Erik's point about vertical scrolling.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Pageinspect - add functions on GIN and GiST indexes from gevel

2017-09-07 Thread Tomas Vondra
Hi,

On 07/21/2017 03:40 PM, Alexander Korotkov wrote:
> Hi, Alexey!
> 
> On Fri, Jul 21, 2017 at 3:05 PM, Alexey Chernyshov
> <a.chernys...@postgrespro.ru <mailto:a.chernys...@postgrespro.ru>> wrote:
> 
> the following patch transfers functionality from gevel module
> (http://www.sai.msu.su/~megera/wiki/Gevel
> <http://www.sai.msu.su/~megera/wiki/Gevel>) which provides functions for
> analyzing GIN and GiST indexes to pageinspect. Gevel was originally
> designed by Oleg Bartunov, and Teodor Sigaev for developers of GiST and
> GIN indexes.
> 
> 
> It's very good that you've picked up this work!  pageinspect is lacking
> of this functionality.  
> 
> Functions added:
>  - gist_stat(text) - shows statistics on GiST Tree
>  - gist_tree(text) - shows GiST tree
>  - gist_tree(text, int4) - shows GiST tree up to MAXLEVEL
>  - gist_print(text) - prints objects stored in GiST tree
>  - spgist_stat(text) - shows statistics on SP-GiST
>  - spgist_print(text) - prints objects stored in index
>  - gin_value_count() - originally gin_stat(text) - prints estimated
> counts
> for index values
>  - gin_stats() - originally gin_statpage(text) - shows statistics
>  - gin_count_estimate(text, tsquery) - shows number of indexed rows
> matched
> query
> 
> Tests also transferred, docs for new functions are added. I run pgindent
> over the code, but the result is different from those I expected, so
> I leave
> pgindented one.
> The patch is applicable to the commit
> 866f4a7c210857aa342bf901558d170325094dde.
> 
> 
> As we can see, gevel contains functions which analyze the whole index.
>  pageinspect is written in another manner: it gives you functionality to
> analyze individual pages, tuples and so on.
> Thus, we probably shouldn't try to move gevel functions to pageinspect
> "as is".  They should be rewritten in more granular manner as well as
> other pageinspact functions are.  Any other opinions?
>  

I agree with Alexander that pageinspect is written in a very different
way - as the extension name suggests, it's used to inspect pages. The
proposed patch uses a very different approach, reading the whole index,
not individual pages. Why should it be part of pageinspect?

For example we have pgstattuple extension, which seems like a better
match. Or just create a new extension - if the code is valuable, surely
we can live one more extension instead of smuggling it in inside
pageinspect.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The case for removing replacement selection sort

2017-09-07 Thread Tomas Vondra
Hi,

On 08/31/2017 02:56 AM, Peter Geoghegan wrote:
> On Wed, Aug 30, 2017 at 5:38 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> Wow.  Just to be clear, I am looking for the BEST case for replacement
>> selection, not the worst case.  But I would have expected that case to
>> be a win for replacement selection, and it clearly isn't.  I can
>> reproduce your results here.
> 
> But I *was* trying to get a best case. That's why it isn't even worse.
> That's what the docs say the best case is, after all.
> 
> This is the kind of thing that replacement selection actually did do
> better with on 9.6. I clearly remember Tomas Vondra doing lots of
> benchmarking, showing some benefit with RS with a work_mem of 8MB or
> less. As I said in my introduction on this thread, we weren't wrong to
> add replacement_sort_tuples to 9.6, given where things were with
> merging at the time. But, it does very much appear to create less than
> zero benefit these days. The picture changed.
> 

Do we need/want to repeat some of that benchmarking on these patches? I
don't recall how much this code changed since those benchmarks were done
in the 9.6 cycle.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GnuTLS support

2017-09-07 Thread Tomas Vondra
Hi,

On 09/04/2017 04:24 PM, Bruce Momjian wrote:
> On Fri, Sep  1, 2017 at 12:09:35PM -0400, Robert Haas wrote:
>> I think that what this shows is that the current set of GUCs is overly
>> OpenSSL-centric.  We created a set of GUCs that are actually specific
>> to one particular implementation but named them as if they were
>> generic.  My idea about this would be to actually rename the existing
>> GUCs to start with "openssl" rather than "ssl", and then add new GUCs
>> as needed for other SSL implementations.
>>
>> Depending on what we think is best, GUCs for an SSL implementation
>> other than the one against which we compiled can either not exist at
>> all, or can exist but be limited to a single value (e.g. "none", as we
>> currently do when the compile has no SSL support at all).  Also, we
>> could add a read-only GUC with a name like ssl_library that exposes
>> the name of the underlying SSL implementation - none, openssl, gnutls,
>> or whatever.
>>
>> I think if we go the route of insisting that every SSL implementation
>> has to use the existing GUCs, we're just trying to shove a square peg
>> into a round hole, and there's no real benefit for users.  If the
>> string that has to be stuffed into ssl_ciphers differs based on which
>> library was chosen at compile time, then you can't have a uniform
>> default configuration for all libraries anyway.  I think it'll be
>> easier to explain and document this if there's separate documentation
>> for openssl_ciphers, gnutls_ciphers, etc. rather than one giant
>> documentation section that tries to explain every implementation
>> separately.
> 
> I am worried about having 3x version of TLS controls in
> postgresql.conf, and only one set being active. Perhaps we need to
> break out the TLS config to separate files or something. Anyway, this
> needs more thought.
> 

Well, people won't be able to set the inactive options, just like you
can't set ssl=on when you build without OpenSSL support. But perhaps we
could simply not include the inactive options into the config file, no?

I don't see how breaking the TLS config into separate files improves the
situation - instead of dealing with GUCs in a single file you'll be
dealing with the same GUCs in multiple files. No?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remove 1MB size limit in tsvector

2017-09-07 Thread Tomas Vondra
Hi,

On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote:
> In my benchmarks when database fits into buffers (so it's measurement of
> the time required for the tsvectors conversion) it gives me these
> results:
> 
> Without conversion:
> 
> $ ./tsbench2 -database test1 -bench_time 300
> 2017/08/17 12:04:44 Number of connections:  4
> 2017/08/17 12:04:44 Database:  test1
> 2017/08/17 12:09:44 Processed: 51419
> 
> With conversion:
> 
> $ ./tsbench2 -database test1 -bench_time 300
> 2017/08/17 12:14:31 Number of connections:  4
> 2017/08/17 12:14:31 Database:  test1
> 2017/08/17 12:19:31 Processed: 43607
> 
> I ran a bunch of these tests, and these results are stable on my
> machine. So in these specific tests performance regression about 15%.
> 
> Same time I think this could be the worst case, because usually data
> is on disk and conversion will not affect so much to performance.
> 

That seems like a fairly significant regression, TBH. I don't quite
agree we can simply assume in-memory workloads don't matter, plenty of
databases have 99% cache hit ratio (particularly when considering not
just shared buffers, but also page cache).

Can you share the benchmarks, so that others can retry running them?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hooks to track changed pages for backup purposes

2017-09-07 Thread Tomas Vondra
Hi,

On 09/01/2017 08:13 AM, Andrey Borodin wrote:
> Thank you for your reply, Michael! Your comments are valuable,
especially in the world of backups.
>
>> 31 авг. 2017 г., в 19:44, Michael Paquier <michael.paqu...@gmail.com>
написал(а):
>> Such things are not Postgres-C like.
> Will be fixed.
>

A few more comments:

* The patch defines wal_switch_hook, but it's never called.

* I see there are conditions like this:

if(xlogreader->blocks[nblock].forknum == MAIN_FORKNUM)

Why is it enough to restrict the block-tracking code to main fork?
Aren't we interested in all relation forks?

>> I don't understand what xlog_begin_insert_hook() is good for.
> memset control structures and array of blocknos and relfilenodes of
the size XLR_MAX_BLOCK_ID .
>

Why can't that happen in the other hooks? I guess you'll have to explain
what the implementation of the hooks is supposed to do, and why these
locations for hook calls are the right ones. It's damn impossible to
validate the patch without that information.

Assuming you still plan to use the hook approach ...

>> There
>> are no arguments fed to this hook, so modules would not be able to
>> analyze things in this context, except shared memory and process
>> state?
>
>>
>> Those hooks are put in hot code paths, and could impact performance of
>> WAL insertion itself.
> I do not think sending few bytes to cached array is comparable to disk
write of XLog record. Checking the func ptr is even cheaper with correct
branch prediction.
>

That seems somewhat suspicious, for two reasons. Firstly, I believe we
only insert the XLOG records into WAL buffer here, so why should there
be any disk write related? Or do you mean the final commit?

But more importantly, doesn't this kind of information require some
durability guarantees? I mean, if it gets lost during server crashes or
restarts, doesn't that mean the incremental backups might miss some
buffers? I'd guess the hooks will have to do some sort of I/O, to
achieve that, no?

In any case, claims like this probably deserve some benchmarking.

>> So you basically move the cost of scanning WAL
>> segments for those blocks from any backup solution to the WAL
>> insertion itself. Really, wouldn't it be more simple to let for
>> example the archiver process to create this meta-data if you just want
>> to take faster backups with a set of segments? Even better, you could
>> do a scan after archiving N segments, and then use M jobs to do this
>> work more quickly. (A set of background workers could do this job as
>> well).
> I like the idea of doing this during archiving. It is different
trade-off between performance of OLTP and performance of backuping.
Essentially, it is parsing WAL some time before doing backup. The best
thing about it is usage of CPUs that are usually spinning in idle loop
on backup machines.
>
>> In the backup/restore world, backups can be allowed to be taken at a
>> slow pace, what matters is to be able to restore them quickly.
> Backups are taken much more often than restored.
>

I agree. The ability to take backups quickly (and often) is just as
important as fast recovery.

>> In short, anything moving performance from an external backup code path
>> to a critical backend code path looks like a bad design to begin with.
>> So I am dubious that what you are proposing here is a good idea.
> I will think about it more. This proposal takes vanishingly small part
of backend performance, but, indeed, nonzero part.
>

But I think this is not necessarily just about code paths - moving it to
the archiver or a bunch of background workers may easily have just as
negative effect (or event worse), as it's using the same CPUs, has to
actually re-read the WAL segments from disk (which may be a lot of I/O,
particularly when done from multiple processes).

>From this POV, the idea to collect this information on the backup system
(WAL archive) by pre-processing the arriving WAL segments seems like the
most promising. It moves the work to another system, the backup system
can make it as durable as the WAL segments, etc.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-09-07 Thread Tomas Vondra
Hi,

Attached is an updated version of the patch, fixing the issues reported
by Adrien Nayrat, and also a bunch of issues pointed out by valgrind.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Tomas Vondra


On 08/30/2017 02:19 AM, Andres Freund wrote:
> On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote:
>>
>> I'm not really following your reasoning. You may very well be right that
>> the BeginInternalSubTransaction() example is not quite valid on postgres
>> core, but I don't see how that implies there can be no other errors in
>> the PG_TRY block. It was merely an explanation how I noticed this issue.
>>
>> To make it absolutely clear, I claim that the PG_CATCH() block is
>> demonstrably broken as it calls AbortCurrentTransaction() and then
>> accesses already freed memory.
> 
> Yea, but that's not what happens normally. The txn memory isn't
> allocated in the context created by the started [sub]transaction. I
> think you're just seeing what you're seeing because an error thrown
> before the BeginInternalSubTransaction() succeeds, will roll back the
> *containing* transaction (the one that did the SQL SELECT * FROM
> pg_*logical*() call), and free all the entire reorderbuffer stuff.
> 
> 
>> The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in
>> the switch, and AFAICS hitting any of them will have exactly the same
>> effect as failure in BeginInternalSubTransaction().
> 
> No, absolutely not. Once the subtransaction starts, the
> AbortCurrentTransaction() will abort that subtransaction, rather than
> the containing one.
> 

Ah, I see. You're right elog(ERROR) calls after the subtransaction start
are handled correctly and don't trigger a segfault. Sorry for the noise
and thanks for the explanation.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Tomas Vondra


On 08/30/2017 01:34 AM, Andres Freund wrote:
> Hi,
> 
> On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote:
>> I've been investigating some failures in test_decoding regression tests,
>> and it seems to me the error-handling in ReorderBufferCommit() is
>> somewhat broken, leading to segfault crashes.
>>
>> The problematic part is this:
>>
>> PG_CATCH()
>> {
>> /*
>>  * Force cache invalidation to happen outside of a valid transaction
>>  * to prevent catalog access as we just caught an error.
>>  */
>> AbortCurrentTransaction();
>>
>> /* make sure there's no cache pollution */
>> ReorderBufferExecuteInvalidations(rb, txn);
>>
>> ...
>>
>> }
>>
>> Triggering it trivial - just add elog(ERROR,...) at the beginning of the
>> PG_TRY() block.
> 
> That's not really a valid thing to do, you should put it after the
> BeginInternalSubTransaction()/StartTransactionCommand(). It's basically
> assumed that those won't fail - arguably they should be outside of a
> PG_TRY then, but that's a different matter.  If you start decoding
> outside of SQL failing before those will lead to rolling back the parent
> tx...
> 
> 
>> I suppose this is not quite intentional, but rather an example that
>> error-handling code is an order of magnitude more complicated to write
>> and test. I've only noticed as I'm investigating some regression
>> failures on Postgres-XL 10, which does not support subtransactions and
>> so the BeginInternalSubTransaction() call in the try branch always
>> fails, triggering the issue.
> 
> So, IIUC, there's no live problem in postgres core, besides some ugly &
> undocumented assumptions?
> 

I'm not really following your reasoning. You may very well be right that
the BeginInternalSubTransaction() example is not quite valid on postgres
core, but I don't see how that implies there can be no other errors in
the PG_TRY block. It was merely an explanation how I noticed this issue.

To make it absolutely clear, I claim that the PG_CATCH() block is
demonstrably broken as it calls AbortCurrentTransaction() and then
accesses already freed memory.

The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in
the switch, and AFAICS hitting any of them will have exactly the same
effect as failure in BeginInternalSubTransaction(). And I suppose there
many other ways to trigger an error and get into the catch block. In
other words, why would we have the block at all?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Tomas Vondra
Hi,

I've been investigating some failures in test_decoding regression tests,
and it seems to me the error-handling in ReorderBufferCommit() is
somewhat broken, leading to segfault crashes.

The problematic part is this:

PG_CATCH()
{
/*
 * Force cache invalidation to happen outside of a valid transaction
 * to prevent catalog access as we just caught an error.
 */
AbortCurrentTransaction();

/* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(rb, txn);

...

}

Triggering it trivial - just add elog(ERROR,...) at the beginning of the
PG_TRY() block.

The problem is that AbortCurrentTransaction() apparently releases the
memory where txn is allocated, making it entirely bogus. So in assert
builds txn->ivalidations are 0x7f7f7f7f7f7f7f7f,  triggering a segfault.

Similar issues apply to subsequent calls in the catch block, that also
use txn in some way (e.g. through snapshot_now).

I suppose this is not quite intentional, but rather an example that
error-handling code is an order of magnitude more complicated to write
and test. I've only noticed as I'm investigating some regression
failures on Postgres-XL 10, which does not support subtransactions and
so the BeginInternalSubTransaction() call in the try branch always
fails, triggering the issue.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-08-26 Thread Tomas Vondra


On 08/17/2017 12:06 PM, Adrien Nayrat wrote:>
> Hello,
> 
> There is no check of "statistics type/kind" in
> pg_stats_ext_mcvlist_items and pg_histogram_buckets.
> 
> select stxname,stxkind from pg_statistic_ext ; stxname  | stxkind 
> ---+- stts3 | {h} stts2 | {m}
> 
> So you can call :
> 
> SELECT * FROM pg_mcv_list_items((SELECT oid FROM pg_statistic_ext
> WHERE stxname = 'stts3'));
> 
> SELECT * FROM pg_histogram_buckets((SELECT oid FROM pg_statistic_ext
> WHERE stxname = 'stts2'), 0);
> 
> Both crashes.
> 

Thanks for the report, this is clearly a bug. I don't think we need to
test the stxkind, but rather a missing check that the requested type is
actually built.

> Unfotunately, I don't have the knowledge to produce a patch :/
> 
> Small fix in documentation, patch attached.
> 

Thanks, will fix.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-08-15 Thread Tomas Vondra



On 08/15/2017 09:55 PM, Robert Haas wrote:

On Tue, Aug 15, 2017 at 3:42 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

What I think we should not do is interpret the bitmasks (omitting some of
the information) assuming all the bits were set correctly.


I'm still confused. HEAP_XMIN_COMMITTED|HEAP_XMIN_ABORTED == 
HEAP_XMIN_FROZEN. Nobody is proposing to omit anything; to the 
contrary, what's being proposed is not to display the same thing

twice (and in a misleading fashion, to boot).



I understand your point. Assume you're looking at this bit of code:

if (HeapTupleHeaderXminCommitted(enumval_tup->t_data))
return;

which is essentially

if (enumval_tup->t_data & HEAP_XMIN_COMMITTED)
return;

If the function only gives you HEAP_XMIN_FROZEN, how likely is it you 
miss this actually evaluates as true?


You might say that people investigating issues in this area of code 
should be aware of how HEAP_XMIN_FROZEN is defined, and perhaps you're 
right ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-08-15 Thread Tomas Vondra



On 08/15/2017 07:54 PM, Robert Haas wrote:

On Tue, Aug 15, 2017 at 9:59 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

I don't think so -- the "committed" and "invalid" meanings are
effectively canceled when the "frozen" mask is present.

I mean, "committed" and "invalid" contradict each other...


FWIW I agree with Craig - the functions should output the masks raw, without
any filtering. The reason is that when you're investigating data corruption
or unexpected behavior, all this is very useful when reasoning about what
might (not) have happened.

Or at least make the filtering optional.


I don't think "filtering" is the right way to think about it.  It's
just labeling each combination of bits with the meaning appropriate to
that combination of bits.

I mean, if you were displaying the contents of a CLOG entry, would you
want the value 3 to be displayed as COMMITTED ABORTED SUBCOMMITTED
because TRANSACTION_STATUS_COMMITTED|TRANSACTION_STATUS_ABORTED ==
TRANSACTION_STATUS_SUB_COMMITTED?

I realize that you may be used to thinking of the HEAP_XMIN_COMMITTED
and HEAP_XMAX_COMMITTED bits as two separate bits, but that's not
really true any more.  They're a 2-bit field that can have one of four
values: committed, aborted, frozen, or none of the above.



All I'm saying is that having the complete information (knowing which 
bits are actually set in the bitmask) is valuable when reasoning about 
how you might have gotten to the current state. Which I think is what 
Craig is after.


What I think we should not do is interpret the bitmasks (omitting some 
of the information) assuming all the bits were set correctly.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-08-15 Thread Tomas Vondra



On 08/15/2017 03:24 PM, Robert Haas wrote:

On Mon, Aug 14, 2017 at 9:59 PM, Craig Ringer <cr...@2ndquadrant.com> wrote:

The bits are set, those macros just test to exclude the special meaning of
both bits being set at once to mean "frozen".

I was reluctant to filter out  HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID
when we detect that it's frozen, because that could well be misleading when
debugging.


I don't think so -- the "committed" and "invalid" meanings are
effectively canceled when the "frozen" mask is present.

I mean, "committed" and "invalid" contradict each other...



FWIW I agree with Craig - the functions should output the masks raw, 
without any filtering. The reason is that when you're investigating data 
corruption or unexpected behavior, all this is very useful when 
reasoning about what might (not) have happened.


Or at least make the filtering optional.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)

2017-08-13 Thread Tomas Vondra

Hi,

Attached is a rebased version of the Generational context, originally 
submitted with SlabContext (which was already committed into Pg 10).


The main change is that I've abandoned the pattern of defining a Data 
structure and then a pointer typedef, i.e.


typedef struct GenerationContextData { ... } GenerationContextData;
typedef struct GenerationContextData *GenerationContext;

Now it's just

typedef struct GenerationContext { ... } GenerationContext;

mostly because SlabContext was committed like that, and because Andres 
was complaining about this code pattern ;-)


Otherwise the design is the same as repeatedly discussed before.

To show that this is still valuable change (even after SlabContext and 
adding doubly-linked list to AllocSet), I've repeated the test done by 
Andres in [1] using the test case described in [2], that is


  -- generate data
  SELECT COUNT(*) FROM (SELECT test1()
  FROM generate_series(1, 5)) foo;

  -- benchmark (measure time and VmPeak)
  SELECT COUNT(*) FROM (SELECT *
  FROM pg_logical_slot_get_changes('test', NULL,
NULL, 'include-xids', '0')) foo;

with different values passed to the first step (instead of the 5). 
The VmPeak numbers look like this:


 N   masterpatched
--
10   1155220 kB  361604 kB
20   2020668 kB  434060 kB
30   2890236 kB  502452 kB
40   3751592 kB  570816 kB
50   4621124 kB  639168 kB

and the timing (on assert-enabled build):

 N   masterpatched
--
10  1103.182 ms 412.734 ms
20  2216.711 ms 820.438 ms
30  3320.095 ms1223.576 ms
40  4584.919 ms1621.261 ms
50  5590.444 ms2113.820 ms

So it seems it's still a significant improvement, both in terms of 
memory usage and timing. Admittedly, this is a single test, so ideas of 
other useful test cases are welcome.


regards


[1] 
https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf%40alap3.anarazel.de


[2] 
https://www.postgresql.org/message-id/20160706185502.1426.28143%40wrigleys.postgresql.org


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 1c46d25ffa9bb104c415cba7c7b3a013958b6ab5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@2ndquadrant.com>
Date: Mon, 14 Aug 2017 01:52:50 +0200
Subject: [PATCH] Generational memory allocator

This memory context is based on the assumption that the allocated chunks
have similar lifespan, i.e. that chunks allocated close from each other
(by time) will also be freed in close proximity, and mostly in the same
order. This is typical for various queue-like use cases, i.e. when
tuples are constructed, processed and then thrown away.

The memory context uses a very simple approach to free space management.
Instead of a complex global freelist, each block tracks a number
of allocated and freed chunks. The space released by freed chunks is not
reused, and once all chunks are freed (i.e. when nallocated == nfreed),
the whole block is thrown away. When the allocated chunks have similar
lifespan, this works very well and is extremely cheap.
---
 src/backend/replication/logical/reorderbuffer.c |  74 +--
 src/backend/utils/mmgr/Makefile |   2 +-
 src/backend/utils/mmgr/generation.c | 768 
 src/include/nodes/memnodes.h|   4 +-
 src/include/nodes/nodes.h   |   1 +
 src/include/replication/reorderbuffer.h |  15 +-
 src/include/utils/memutils.h|   5 +
 7 files changed, 790 insertions(+), 79 deletions(-)
 create mode 100644 src/backend/utils/mmgr/generation.c

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5567bee..5309170 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -150,15 +150,6 @@ typedef struct ReorderBufferDiskChange
  */
 static const Size max_changes_in_memory = 4096;
 
-/*
- * We use a very simple form of a slab allocator for frequently allocated
- * objects, simply keeping a fixed number in a linked list when unused,
- * instead pfree()ing them. Without that in many workloads aset.c becomes a
- * major bottleneck, especially when spilling to disk while decoding batch
- * workloads.
- */
-static const Size max_cached_tuplebufs = 4096 * 2;	/* ~8MB */
-
 /* ---
  * primary reorderbuffer support routines
  * ---
@@ -248,6 +239,10 @@ ReorderBufferAllocate(void)
 			SLAB_DEFAULT_BLOCK_SIZE,
 			sizeof(ReorderBufferTXN)

Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-07-25 Thread Tomas Vondra




On 7/25/17 5:04 PM, Tom Lane wrote:

Tomas Vondra <tomas.von...@2ndquadrant.com> writes:

On 7/25/17 12:55 AM, Tom Lane wrote:
I think the planner basically assumes that reltuples is the live 
tuple count, so maybe we'd better change VACUUM to get in step.



Attached is a patch that (I think) does just that. The disagreement
was caused by VACUUM treating recently dead tuples as live, while
ANALYZE treats both of those as dead.



At first I was worried that this will negatively affect plans in
the long-running transaction, as it will get underestimates (due
to reltuples not including rows it can see). But that's a problem
we already have anyway, you just need to run ANALYZE in the other
session.


This definitely will have some impact on plans, at least in cases
where there's a significant number of unvacuumable dead tuples. So I
think it's a bit late for v10, and I wouldn't want to back-patch at
all. Please add to the next commitfest.



I dare to disagree here, for two reasons.

Firstly, the impact *is* already there, it only takes running ANALYZE. 
Or VACUUM ANALYZE. In both those cases we already end up with 
reltuples=n_live_tup.


Secondly, I personally strongly prefer stable predictable behavior over 
intermittent oscillations between two values. That's a major PITA on 
production, both to investigate and fix.


So people already have this issue, although it only strikes randomly. 
And no way to fix it (well, except for fixing the cleanup, but that may 
not be possible).


It is true we tend to run VACUUM more often than ANALYZE, particularly 
in situations where the cleanup can't proceed - ANALYZE will do it's 
work and VACUUM will be triggered over and over again, so it "wins" this 
way. But I'm not sure that's something we should rely on.



FWIW I personally see this as a fairly annoying bug, and would vote to 
backpatch it, although I understand people might object. But I don't 
quite see a reason not to fix this in v10.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-07-25 Thread Tomas Vondra

On 7/25/17 12:55 AM, Tom Lane wrote:

Tomas Vondra <tomas.von...@2ndquadrant.com> writes:

It seems to me that VACUUM and ANALYZE somewhat disagree on what
exactly reltuples means. VACUUM seems to be thinking that reltuples
= live + dead while ANALYZE apparently believes that reltuples =
live



The question is - which of the reltuples definitions is the right
one? I've always assumed that "reltuples = live + dead" but perhaps
not?


I think the planner basically assumes that reltuples is the live
tuple count, so maybe we'd better change VACUUM to get in step.



Attached is a patch that (I think) does just that. The disagreement was 
caused by VACUUM treating recently dead tuples as live, while ANALYZE 
treats both of those as dead.


At first I was worried that this will negatively affect plans in the 
long-running transaction, as it will get underestimates (due to 
reltuples not including rows it can see). But that's a problem we 
already have anyway, you just need to run ANALYZE in the other session.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c5c194a..574ca70 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1261,7 +1261,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false,
 		 nblocks,
  vacrelstats->tupcount_pages,
-		 num_tuples);
+		 num_tuples - nkeep);
 
 	/*
 	 * Release any remaining pin on visibility map page.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-07-24 Thread Tomas Vondra

Hi,

It seems to me that VACUUM and ANALYZE somewhat disagree on what exactly 
reltuples means. VACUUM seems to be thinking that


reltuples = live + dead

while ANALYZE apparently believes that

reltuples = live

This causes somewhat bizarre changes in the value, depending on which of 
those commands was executed last.


To demonstrate the issue, let's create a simple table with 1M rows, 
delete 10% rows and then we'll do a bunch of VACUUM / ANALYZE and check 
reltuples, n_live_tup and n_dead_tup in the catalogs.


I've disabled autovacuum so that it won't interfere with this, and 
there's another transaction blocking VACUUM from actually cleaning any 
dead tuples.



test=# create table t as
   select i from generate_series(1,100) s(i);

test=# select reltuples, n_live_tup, n_dead_tup
 from pg_stat_user_tables join pg_class using (relname)
where relname = 't';

 reltuples | n_live_tup | n_dead_tup
---++
 1e+06 |100 |  0

So, that's nice. Now let's delete 10% of rows, and run VACUUM and 
ANALYZE a few times.


test=# delete from t where random() < 0.1;

test=# vacuum t;

test=# select reltuples, n_live_tup, n_dead_tup
 from pg_stat_user_tables join pg_class using (relname)
where relname = 't';

 reltuples | n_live_tup | n_dead_tup
---++
 1e+06 | 900413 |  99587


test=# analyze t;

 reltuples | n_live_tup | n_dead_tup
---++
900413 | 900413 |  99587

test=# vacuum t;

 reltuples | n_live_tup | n_dead_tup
---++
 1e+06 | 900413 |  99587


So, analyze and vacuum disagree.

To further confuse the poor DBA, VACUUM always simply ignores the old 
values while ANALYZE combines the old and new values on large tables 
(and converges to the "correct" value after a few steps). This table is 
small (less than 30k pages), so ANALYZE does not do that.


This is quite annoying, because people tend to look at reltuples while 
investigating bloat (e.g. because the check_postgres query mentioned on 
our wiki [1] uses reltuples in the formula).


[1] https://wiki.postgresql.org/wiki/Show_database_bloat

And when the cleanup is blocked for some reason (as in the example 
above), VACUUM tends to be running much more often (because it can't 
cleanup anything). So reltuples tend to be set to the higher value, 
which I'd argue is the wrong value for estimating bloat.


I haven't looked at the code yet, but I've confirmed this happens both 
on 9.6 and 10. I haven't checked older versions, but I guess those are 
affected too.


The question is - which of the reltuples definitions is the right one? 
I've always assumed that "reltuples = live + dead" but perhaps not?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-26 Thread Tomas Vondra

Hi,

On 06/26/2017 05:18 PM, Alexander Korotkov wrote:

Hackers,

I see that design question for PostgreSQL pluggable storages is very 
hard.


IMHO it's mostly expected to be hard.

Firstly, PostgreSQL is a mature product with many advanced features, and 
reworking a low-level feature without breaking something on top of it is 
hard by definition.


Secondly, project policies and code quality requirements set the bar 
very high too, I think.


> BTW, I think it worth analyzing existing use-cases of pluggable

storages.  I think that the most famous DBMS with pluggable storage API
is MySQL. This why I decided to start with it. I've added
MySQL/MariaDB section on wiki page.
https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB
It appears that significant part of MySQL storage engines are misuses.  
MySQL lacks of various features like FDWs or writable views and so on.  
This is why developers created a lot of pluggable storages for that 
purposes.  We definitely don't want something like this in PostgreSQL 
now.  I created special resume column where I expressed whether it

would be nice to have something like this table engine in PostgreSQL.



I don't want to discourage you, but I'm not sure how valuable this is.

I agree it's valuable to have a an over-view of use cases for pluggable 
storage, but I don't think we'll get that from looking at MySQL. As you 
noticed, most of the storage engines are misuses, so it's difficult to 
learn anything valuable from them. You can argue that using FDWs to 
implement alternative storage engines is a misuse too, but at least that 
gives us a valid use case (columnar storage implemented using FDW).


If anything, the MySQL storage engines should serve as a cautionary tale 
how not to do things - there's also a plenty of references in the MySQL 
"Restrictions and Limitations" section of the manual:


  https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-23 Thread Tomas Vondra

Hi,

On 6/23/17 10:38 AM, Teodor Sigaev wrote:

1. Table AM with a 6-byte TID.
2. Table AM with a custom locator format, which could be TID-like.
3. Table AM with no locators.


Currently TID has its own type in system catalog. Seems, it's possible 
that storage claims type of TID which it uses. Then, AM could claim it 
too, so the core based on that information could solve the question 
about AM-storage compatibility. Storage could also claim that it hasn't 
TID type at all so it couldn't be used with any access method, use case: 
compressed column oriented storage.




Isn't the fact that TID is an existing type defined in system catalogs 
is a fairly insignificant detail? I mean, we could just as easily define 
a new 64-bit locator data type, and use that instead, for example.


The main issue here is that we assume things about the TID contents, 
i.e. that it contains page/offset etc. And Bitmap nodes rely on that, to 
some extent - e.g. when prefetching data.


>
As I remeber, only GIN depends on TID format, other indexes use it as 
opaque type. Except, at least, btree and GiST - they believe that 
internal pointers are the same as outer (to heap)




I think you're probably right - GIN does compress the posting lists by 
exploiting the TID redundancy (that it's page/offset structure), and I 
don't think there are other index AMs doing that.


But I'm not sure we can simply rely on that - it's possible people will 
try to improve other index types (e.g. by adding similar compression to 
other index types). Moreover we now support extensions defining custom 
index AMs, and we have no clue what people may do in those.


So this would clearly require some sort of flag for each index AM.

>

Another dubious part - BitmapScan.



It would be really great if you could explain why BitmapScans are 
dubious, instead of just labeling them as dubious. (I guess you mean 
Bitmap Heap Scans, right?)


I see no conceptual issue with bitmap scans on arbitrary locator types, 
as long as there's sufficient locality information encoded in the value. 
What I mean by that is that for any two locator values A and B:


   (1) if (A < B) then (A is stored before B)

   (2) if (A is close to B) then (A is stored close to B)

Without these features it's probably futile to try to do bitmap scans, 
because the bitmap would not result in mostly sequential access pattern 
and things like prefetch would not be very efficient, I think.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-22 Thread Tomas Vondra

Hi,

On 6/22/17 4:36 PM, Alexander Korotkov wrote:
On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas <robertmh...@gmail.com 
<mailto:robertmh...@gmail.com>> wrote:


On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
<a.korot...@postgrespro.ru <mailto:a.korot...@postgrespro.ru>> wrote:
> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paqu...@gmail.com 
<mailto:michael.paqu...@gmail.com>>
> wrote:
>> Putting that in a couple of words.
>> 1. Table AM with a 6-byte TID.
>> 2. Table AM with a custom locator format, which could be TID-like.
>> 3. Table AM with no locators.
>>
>> Getting into having #1 first to work out would already be really
>> useful for users.
>
> What exactly would be useful for *users*?  Any kind of API itself is
> completely useless for users, because they are users, not developers.
> Storage API could be useful for developers to implement storage AMs whose 
in
> turn could be useful for users.

What's your point?  I assume that is what Michael meant.


TBH, I don't understand what particular enchantments do we expect from 
having #1.


I'd say that's one of the things we're trying to figure out in this 
thread. Does it make sense to go with #1 in v1 of the patch, or do we 
have to implement #2 or #3 to get some real benefit for the users?


>
This is why it's hard for me to say if #1 is good idea.  It's also hard 
for me to say if patch upthread is right way of implementing #1.
But, I have gut feeling that if even #1 is good idea itself, it's 
definitely not what users expect from "pluggable storages".


The question is "why" do you think that. What features do you expect 
from pluggable storage API that would be impossible to implement with 
option #1?


I'm not trying to be annoying or anything like that - I don't know the 
answer and discussing those things is exactly why this thread exists.


I do agree #1 has limitations, and that it'd be great to get API that 
supports all kinds of pluggable storage implementations. But I guess 
that'll take some time.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pluggable storage

2017-06-22 Thread Tomas Vondra

Hi,

On 6/21/17 9:47 PM, Robert Haas wrote:
...
>

like int8 or numeric, it won't work at all.  Even for other things,
it's going to cause problems because the bit patterns won't be what
the code is expecting; e.g. bitmap scans care about the structure of
the TID, not just how many bits it is.  (Due credit: Somebody, maybe
Alvaro, pointed out this problem before, at PGCon.) 


Can you elaborate a bit more about this TID bit pattern issues? I do 
remember that not all TIDs are valid due to safeguards on individual 
fields, like for example


Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits))

But perhaps there are some other issues?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-18 Thread Tomas Vondra

Hi,

On 6/7/17 5:52 AM, Regina Obe wrote:

On 6/6/17 13:52, Regina Obe wrote:

It seems CREATE  AGGREGATE was expanded in 9.6 to support
parallelization of aggregate functions using transitions, with the
addition of serialfunc and deserialfunc to the aggregate definitions.

https://www.postgresql.org/docs/10/static/sql-createaggregate.html

I was looking at the PostgreSQL 10 source code for some example usages
of this and was hoping that array_agg and string_agg would support the feature.



I'm not sure how you would parallelize these, since in most uses
you want to have a deterministic output order.



--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Good point.  If that's the reason it wasn't done, that's good just wasn't sure.

But if you didn't have an ORDER BY in your aggregate usage, and you
did have those transition functions, it shouldn't be any different from
any other use case right?
I imagine you are right that most folks who use array_agg and
string_agg usually combine it with array_agg(... ORDER BY ..)



I think that TL had in mind is something like

SELECT array_agg(x) FROM (
   SELECT x FROM bar ORDER BY y
) foo;

i.e. a subquery producing the data in predictable order.

>

My main reason for asking is that most of the PostGIS geometry and
raster aggregate functions use transitions and were patterned after
array agg.




In the case of PostGIS the sorting is done internally and really
only to expedite take advantage of things like cascaded union
algorithms.
That is always done though (so even if each worker does it on just it's
batch that's still better than having only one worker).
So I think it's still very beneficial to break into separate jobs
since in the end the gather, will have say 2 biggish geometries or 2
biggish rasters to union if you have 2 workers which is still better
than having a million smallish geometries/rasters to union
I'm not sure I got your point correctly, but if you can (for example) 
sort the per-worker results as part of the "serialize" function, and 
benefit from that while combining that in the gather, then sure, that 
should be a huge win.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Data at rest encryption

2017-06-18 Thread Tomas Vondra

Hi all,

I've noticed this thread got resurrected a few days ago, but I haven't 
managed to read all the messages until today. I do have a bunch of 
comments, but let me share them as a single consistent message instead 
of sending a thousand responses to individual messages.



1) Threat model
---

Firstly, I think the thread would seriously benefit from an explanation 
and discussion of the threat model - what types of attacks it's meant to 
address, and what attacks it can't defend against.


My understanding is that data-at-rest encryption generally addresses 
only the "someone stole the disk" case and pretty much nothing else.


Moreover, I claim that encryption implemented at the database-level is 
strictly weaker compared to LUKS or encrypted disks, because it simply 
reveals a lot more information even without decryption (file sizes, 
timestamps, etc.)


That is a serious issue in practice, and researches have been proving 
that for a long time now. I do recommend this paper from Cornell Tech as 
a great starting point (it cites many papers relevant to this thread):


Why Your Encrypted Database Is Not Secure
Paul Grubbs, Thomas Ristenpart, Vitaly Schmatikov
http://eprint.iacr.org/2017/468.pdf

The paper explains how encryption schemes on general-purpose databases 
fail, due to exactly such side-channels. MVCC, logging and other side 
channels turn all attackers into "persistent passive attackers".


Now, this does not mean the feature is useless - nothing is perfect, and 
security is not a binary feature. It certainly makes attacks mode 
difficult compared to plaintext database. But it's untrue that it's 
basically LUKS, just implemented at the database level.


I'm not suggesting that we should not pursue this idea, but the threat 
model is a crucial piece of information missing in this thread.



2) How do other databases do it?


It was repeatedly mentioned that other databases support this type of 
encryption. So how do they deal with the hard parts? For example how do 
they get and protect the encryption key?


I agree with Stephen that we should not require a full key management 
from v1 of the patch, that's an incredibly difficult thing. And it 
largely depends on the hardware (e.g. it should be possible to move the 
key to TrustZone on ARM / SGX on Intel).



3) Why do users prefer this to FDE?
---

I suppose we're discussing this feature because we've been asked about 
it by users/customers who can't use FDE. Can we reach to them and ask 
them about the reasons? Why can't they use FDE?


It was mentioned in the thread that the FDE solutions are not portable 
between different systems, but honestly - is that an issue? You probably 
can't copy the datadir anyway due locale differences anyway. If you're 
running multiple operating systems, FDE is just one of many differences.



4) Other solutions?
---

Clearly, FDE (at the block device level) and DB-level encryption are not 
the only solutions. There are also filesystems-level solutions now, for 
example.


ext4 (since kernel 4.1) and f2fs (since kernel 4.2) allow encryption at 
directory level, are transparent to the user space, and include things 
like key management (well, that's provided by kernel). NTFS can do 
something quite similar using EFS.


https://lwn.net/Articles/639427/

https://blog.quarkslab.com/a-glimpse-of-ext4-filesystem-level-encryption.html

Of course, if you happen to use another filesystem (e.g. XFS), this 
won't work for you. But then there's eCryptfs, for example:


https://en.wikipedia.org/wiki/ECryptfs


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

2017-06-11 Thread Tomas Vondra

Hi,

On 6/11/17 7:54 PM, Peter Geoghegan wrote:

On Sun, Jun 11, 2017 at 10:36 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

Do you mean teaching the optimizer to do something like this?:


Uh, no.  I don't think we want to add any run-time checks.  The point in
this example is that we'd get a better rowcount estimate if we noticed
that the FK constraint could be considered while estimating the size of
the partsupp-to-aggregated-subquery join.


Sorry for not considering the context of the thread more carefully.
Robert said something about selectivity estimation and TPC-H to me,
which I decide to research; I then rediscovered this thread.

Clearly Q20 is designed to reward systems that do better with moving
predicates into subqueries, as opposed to systems with better
selectivity estimation.



I do strongly recommend reading this paper analyzing choke points of 
individual TPC-H queries:


http://oai.cwi.nl/oai/asset/21424/21424B.pdf

It's slightly orthogonal to the issue at hand (poor estimate in Q20 
causing choice of inefficient plan), it's a great paper to read. I 
thought I've already posted a link to the this paper sometime in the 
past, but I don't see it in the archives.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!

2017-05-25 Thread Tomas Vondra

Hi,

On 5/25/17 6:03 AM, Robert Haas wrote:

On Thu, Apr 6, 2017 at 4:37 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

Which brings me to the slightly suspicious bit. On 9.5, there's no
difference between GROUP and GROUP+LIKE cases - the estimates are exactly
the same in both cases. This is true too, but only without the foreign key
between "partsupp" and "part", i.e. the two non-grouped relations in the
join. And what's more, the difference (1737 vs. 16) is pretty much exactly
100x, which is the estimate for the LIKE condition.


I don't follow this.  How does the foreign key between partsupp and
part change the selectivity of LIKE?


So it kinda seems 9.5 does not apply this condition for semi-joins, while

=9.6 does that.




Well, get_foreign_key_join_selectivity() does handle restrictions when 
calculating joinrel size estimate in calc_joinrel_size_estimate(), so 
assuming there's some thinko it might easily cause this.


I haven't found any such thinko, but I don't dare to claim I fully 
understand what the current version of get_foreign_key_join_selectivity 
does :-/



If 9.5 and prior are ignoring some of the quals, that's bad, but I
don't think I understand from your analysis why
7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 regressed anything.



It's been quite a bit of time since I looked into this, but I think my 
main point was that it's hard to say it's a regression when both the old 
and new estimates are so utterly wrong.


I mean, 9.5 estimates 160, 9.6 estimates 18. We might fix the post-9.6 
estimate to return the same value as 9.5, and it might fix this 
particular query with this particular scale. But in the end it's just 
noise considering that the actual value is 120k (so 3 or 4 orders of 
magnitude off).



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-05-12 Thread Tomas Vondra



On 5/12/17 4:46 AM, Tom Lane wrote:

Alvaro Herrera <alvhe...@2ndquadrant.com> writes:

Hmm, yeah, makes sense.  Here's a patch for this approach.


...

Also, while reading the docs changes, I got rather ill from the
inconsistent and not very grammatical handling of "statistics" as a
noun, particularly the inability to decide from one sentence to the next
if that is singular or plural.  In the attached, I fixed this in the
ref/*_statistics.sgml files by always saying "statistics object" instead.
If people agree that that reads better, we should make an effort to
propagate that terminology elsewhere in the docs, and into error messages,
objectaddress.c output, etc.



I'm fine with the 'statistics object' wording. I've been struggling with 
this bit while working on the patch, and I agree it sounds better and 
getting it consistent is definitely worthwhile.


>

Although I've not done anything about it here, I'm not happy about the
handling of dependencies for stats objects.  I do not think that cloning
RemoveStatistics as RemoveStatisticsExt is sane at all.  The former is
meant to deal with cleaning up pg_statistic rows that we know will be
there, so there's no real need to expend pg_depend overhead to track them.
For objects that are only loosely connected, the right thing is to use
the dependency system; in particular, this design has no real chance of
working well with cross-table stats.  Also, it's really silly to have
*both* this hard-wired mechanism and a pg_depend entry; the latter is
surely redundant if you have the former.  IMO we should revert
RemoveStatisticsExt and instead handle things by making stats objects
auto-dependent on the individual column(s) they reference (not the whole
table).

I'm also of the opinion that having an AUTO dependency, rather than
a NORMAL dependency, on the stats object's schema is the wrong semantics.
There isn't any other case where you can drop a non-empty schema without
saying CASCADE, and I'm mystified why this case should act that way.



Yeah, it's a bit frankensteinian ...

>

Lastly, I tried the example given in the CREATE STATISTICS reference page,
and it doesn't seem to work.  Without the stats object, the two queries
are both estimated at one matching row, whereas they really produce 100
and 0 rows respectively.  With the stats object, they seem to both get
estimated at ~100 rows, which is a considerable improvement for one case
but a considerable disimprovement for the other.  If this is not broken,
I'd like to know why not.  What's the point of an expensive extended-
stats mechanism if it can't get this right?



I assume you're talking about the functional dependencies and in that 
case that's expected behavior, because f. dependencies do assume the 
conditions are "consistent" with the functional dependencies.


This is an inherent limitation of functional dependencies, and perhaps a 
price for the simplicity of this statistics type. If your application 
executes queries that are likely not consistent with this, don't use 
functional dependencies.


The simplicity is why dependencies were implemented first, mostly to 
introduce all the infrastructure. The other statistics types (MCV, 
histograms) don't have this limitation, but those did not make it into pg10.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-04 Thread Tomas Vondra



On 5/4/17 8:03 PM, Joe Conway wrote:

On 05/04/2017 10:56 AM, Andrew Dunstan wrote:



On 05/04/2017 01:52 PM, Joe Conway wrote:

On 05/04/2017 10:33 AM, Alvaro Herrera wrote:

I'm not sure what your point is.  We know that for some cases the
optimization barrier semantics are useful, which is why the proposal is
to add a keyword to install one explicitely:

  with materialized r as
  (
 select json_populate_record(null::mytype, myjson) as x
 from mytable
  )
  select (x).*
  from r;

this would preserve the current semantics.

I haven't been able to follow this incredibly long thread, so please
excuse me if way off base, but are we talking about that a CTE would be
silently be rewritten as an inline expression potentially unless it is
decorated with some new syntax?

I would find that very disconcerting myself. For example, would this CTE
potentially get rewritten with multiple evaluation as follows?

DROP SEQUENCE IF EXISTS foo_seq;
CREATE SEQUENCE foo_seq;

WITH a(f1) AS (SELECT nextval('foo_seq'))
SELECT a.f1, a.f1 FROM a;
  f1 | ?column?
+--
   1 |1
(1 row)

ALTER SEQUENCE foo_seq RESTART;
SELECT nextval('foo_seq'), nextval('foo_seq');
  nextval | ?column?
-+--
1 |2
(1 row)



I think that would be a change in semantics, which we should definitely
not be getting. Avoiding a change in semantics might be an interesting
exercise, but we have lots of clever coders ...


Well I think my point is that I always have understood CTEs to be
executed precisely once producing a temporary result set that is then
referenced elsewhere. I don't think that property of CTEs should change.
Somewhere else in the thread someone mentioned predicate push down --
that makes sense and maybe some clever coder can come up with a patch
that does that, but I would not be in favor of CTEs being inlined and
therefore evaluated multiple times.



I agree with this, but there's a difference between "executed exactly 
once" and "producing the same result as if executed exactly once".


I may be misunderstanding what other people proposed in this thread, but 
I think the plan was to only inline CTEs where we know it won't change 
the results, etc. So e.g. CTEs with volatile functions would not get 
inlined, which includes nextval() for example.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-04 Thread Tomas Vondra

On 5/4/17 7:56 PM, Andrew Dunstan wrote:



On 05/04/2017 01:52 PM, Joe Conway wrote:

On 05/04/2017 10:33 AM, Alvaro Herrera wrote:

I'm not sure what your point is.  We know that for some cases the
optimization barrier semantics are useful, which is why the proposal is
to add a keyword to install one explicitely:

  with materialized r as
  (
 select json_populate_record(null::mytype, myjson) as x
 from mytable
  )
  select (x).*
  from r;

this would preserve the current semantics.

I haven't been able to follow this incredibly long thread, so please
excuse me if way off base, but are we talking about that a CTE would be
silently be rewritten as an inline expression potentially unless it is
decorated with some new syntax?

I would find that very disconcerting myself. For example, would this CTE
potentially get rewritten with multiple evaluation as follows?

DROP SEQUENCE IF EXISTS foo_seq;
CREATE SEQUENCE foo_seq;

WITH a(f1) AS (SELECT nextval('foo_seq'))
SELECT a.f1, a.f1 FROM a;
  f1 | ?column?
+--
   1 |1
(1 row)

ALTER SEQUENCE foo_seq RESTART;
SELECT nextval('foo_seq'), nextval('foo_seq');
  nextval | ?column?
-+--
1 |2
(1 row)





I think that would be a change in semantics, which we should definitely
not be getting. Avoiding a change in semantics might be an interesting
exercise, but we have lots of clever coders ...



nextval is a volatile function, and in my understanding the plan was not 
to inline CTEs with volatile functions (or CTEs doing writes etc.). That 
is, we'd guarantee the same results as we get now.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-05-03 Thread Tomas Vondra



On 5/3/17 11:36 PM, Alvaro Herrera wrote:

Andrew Dunstan wrote:


On 05/03/2017 04:42 PM, Tom Lane wrote:



One other point is that as long as we've got reserved keywords introducing
each clause, there isn't actually an implementation reason why we couldn't
accept the clauses in any order.  Not sure I want to document it that way,
but it might not be a bad thing if the grammar was forgiving about whether
you write the USING or ON part first ...


+1 for allowing arbitrary order of clauses. I would document it with the
USING clause at the end, and have that be what psql supports and pg_dump
produces. Since there are no WITH options now we should leave that out
until it's required.


Ok, sounds good to me.  Unless there are objections I'm going to have a
shot at implementing this.  Thanks for the discussion.



Works for me. Do you also plan to remove the parentheses for the USING 
clause?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-03 Thread Tomas Vondra



On 5/3/17 9:54 PM, Andreas Karlsson wrote:

On 05/03/2017 07:33 PM, Alvaro Herrera wrote:

1) we switch unmarked CTEs as inlineable by default in pg11.  What seems
likely to happen for a user that upgrades to pg11 is that 5 out of 10
CTE-using queries are going to become faster than with pg10, and they
are going to be happy; 4 out of five are going to see no difference, but
they didn't have to do anything about it; and the remaining query is
going to become slower, either indistinguishably so (in which case they
don't care and they remain happy because of the other improvements) or
notably so, in which case they can easily figure where to add the
MATERIALIZED option and regain the original performance.


2) unmarked CTEs continue to be an optimization barrier, but we add
"WITH INLINED" so that they're inlineable.  Some users may wonder about
it and waste a lot of time trying to figure out which CTEs to add it to.
They see a benefit in half the queries, which makes them happy, but they
are angry that they had to waste all that time on the other queries.


3) We don't do anything, because we all agree that GUCs are not
suitable.  No progress.  No anger, but nobody is happy either.


+1 for option 1. And while I would not like if we had to combine it with 
a backwards compatibility GUC which enables the old behavior to get it 
merged I still personally would prefer that over option 2 and 3.



> Andreas
>

+1 to what Andreas says

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-02 Thread Tomas Vondra



On 5/2/17 11:23 PM, Merlin Moncure wrote:

\On Tue, May 2, 2017 at 12:05 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

On 5/2/17 6:34 PM, David Fetter wrote:


On Tue, May 02, 2017 at 02:40:55PM +0200, Andreas Karlsson wrote:


On 05/02/2017 04:38 AM, Craig Ringer wrote:


On 1 May 2017 at 22:26, Andreas Karlsson <andr...@proxel.se> wrote:





...

I see some alternatives, none of them perfect.

1. Just remove the optimization fence and let people add OFFSET 0 to
their
queries if they want an optimization fence. This lets us keep pretending
that we do not have query hints (and therefore do not have to formalize
any
syntax for them) while still allowing people to add optimization fences.



+1

I get that people with gigantic PostgreSQL installations with
stringent performance requirements sometimes need to do odd things to
squeeze out the last few percentage points of performance.  As the
people (well, at least the people close to the ground) at these
organizations are fully aware, performance optimizations are extremely
volatile with respect to new versions of software, whether it's
PostgreSQL, Oracle, the Linux kernel, or what have you.  They expect
this, and they have processes in place to handle it.  If they don't,
it's pilot error.

We should not be penalizing all our other users to maintain the
fiction that people can treat performance optimizations as a "fire and
forget" matter.



Agreed.


2. Add a decorator for WITH (e.g. "WITH MATERIALIZED x (...") to add an
explicit optimization fence. This will for the first time add official
support for a query hint in the syntax which is a quite big precedent.



Yep.  It's one we should think very carefully before we introduce.



I think it's a mistake to see this as an introduction of query hits.

Firstly, it's a question whether it qualifies as a hint. I wouldn't call it
a hint, but let's assume there is a definition of query hints that includes
WITH MATERIALIZED.

More importantly, however, this is not introducing anything new. It's just a
different name for the current "WITH" semantics, and you can achieve the
same behavior by "OFFSET 0". And people are already using these as hints, so
I fail to see how this introduces anything new.

In fact, if you see the optimization fence as an implicit query hint, this
actually *removes* a hint (although most users are unaware of that behavior
and use it unintentionally).


+1 down the line.  More to the point, for several years now we've (or
at least I, but I'm not the only one) have been advocating for the
usage of CTE to avoid the undocumented and bizarre OFFSET 0 trick.
Jerking this out from users without giving a simple mechanic to get
the same behavior minus a major query rewrite is blatantly user
hostile.  I can't believe we're even contemplating it.   Also a GUC is
not a solution for pretty obvious reasons I think.



I'm not sure what you mean by "jerking this out from users". Isn't most 
of this thread about how to allow CTE inlining without hurting users 
unnecessarily?


I think we agree that:

* Just removing the optimization fence and telling users to use OFFSET 0 
instead is a no-go, just like removing the fencing and not providing any 
sensible replacement.


* GUC is not the solution.

Which leaves us with either WITH INLINE or WITH MATERIALIZE, or 
something along those lines.


If we go with WITH INLINE then we're likely not solving anything, 
because most people will simply use WITH just like now, and will be 
subject to the fencing without realizing it.


Or we will choose WITH MATERIALIZE, and then the users aware of the 
fencing (and using the CTEs for that purpose) will have to modify the 
queries. But does adding MATERIALIZE quality as major query rewrite?


Perhaps combining this with a GUC would be a solution. I mean, a GUC 
specifying the default behavior, and then INLINE / MATERIALIZE for 
individual CTEs in a query?


If you have an application intentionally using CTEs as a fence, just do

ALTER DATABASE x SET enable_guc_fencing = on

and you don't have to rewrite the queries.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-02 Thread Tomas Vondra

On 5/2/17 6:34 PM, David Fetter wrote:

On Tue, May 02, 2017 at 02:40:55PM +0200, Andreas Karlsson wrote:

On 05/02/2017 04:38 AM, Craig Ringer wrote:

On 1 May 2017 at 22:26, Andreas Karlsson <andr...@proxel.se> wrote:

>>

...

I see some alternatives, none of them perfect.

1. Just remove the optimization fence and let people add OFFSET 0 to their
queries if they want an optimization fence. This lets us keep pretending
that we do not have query hints (and therefore do not have to formalize any
syntax for them) while still allowing people to add optimization fences.


+1

I get that people with gigantic PostgreSQL installations with
stringent performance requirements sometimes need to do odd things to
squeeze out the last few percentage points of performance.  As the
people (well, at least the people close to the ground) at these
organizations are fully aware, performance optimizations are extremely
volatile with respect to new versions of software, whether it's
PostgreSQL, Oracle, the Linux kernel, or what have you.  They expect
this, and they have processes in place to handle it.  If they don't,
it's pilot error.

We should not be penalizing all our other users to maintain the
fiction that people can treat performance optimizations as a "fire and
forget" matter.



Agreed.


2. Add a decorator for WITH (e.g. "WITH MATERIALIZED x (...") to add an
explicit optimization fence. This will for the first time add official
support for a query hint in the syntax which is a quite big precedent.


Yep.  It's one we should think very carefully before we introduce.



I think it's a mistake to see this as an introduction of query hits.

Firstly, it's a question whether it qualifies as a hint. I wouldn't call 
it a hint, but let's assume there is a definition of query hints that 
includes WITH MATERIALIZED.


More importantly, however, this is not introducing anything new. It's 
just a different name for the current "WITH" semantics, and you can 
achieve the same behavior by "OFFSET 0". And people are already using 
these as hints, so I fail to see how this introduces anything new.


In fact, if you see the optimization fence as an implicit query hint, 
this actually *removes* a hint (although most users are unaware of that 
behavior and use it unintentionally).




3. Add a new GUC which can enable and disable the optimization fence. This
is a very clumsy tool, but maybe good enough for some users and some people
here in this thread have complained about our similar GUCs.


Any GUC would be unable to distinguish one WITH clause from another.
The hammer would then be guaranteed to be too big for precisely the
cases where it's most needed.



If I could, I'd give -1 million to a GUC-based approach, as that would 
make it entirely unusable in practice, I think.


Actually, I can give -1 million, so I'm giving it.

>>

4. Add some new more generic query hinting facility. This is a lot
of work and something which would be very hard to get consensus for.


Just the design of the thing would be the work of months at a minimum,
assuming we got to some consensus at all.  Maybe it's worth doing.



While I came to conclusion that query hints may be quite useful in some 
situations, I'm pretty sure this is not a battle you'd like to fight.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-02 Thread Tomas Vondra

On 5/2/17 4:44 PM, Andrew Dunstan wrote:



On 05/02/2017 10:13 AM, Merlin Moncure wrote:

On Sun, Apr 30, 2017 at 6:21 PM, Andres Freund <and...@anarazel.de> wrote:

On 2017-04-30 07:19:21 +0200, Pavel Stehule wrote:

why we cannot to introduce GUC option - enable_cteoptfence ?

Doesn't really solve the issue, and we've generally shied away from GUCs
that influence behaviour after a few bad experiences.  What if you want
one CTE inlined, but another one not?

Yeah.  Are we absolutely opposed to SQL syntax against WITH that
allows or disallows fencing?   for example,

WITH [MATERIALIZED]

Pushing people to OFFSET 0 is a giant step backwards IMO, and as in
implementation detail is also subject to change.




Agreed, it's an ugly as sin and completely non-obvious hack.



Isn't OFFSET 0 an implementation detail anyway? Who says the planner 
couldn't get smarter in the future, realize OFFSET 0 is no-op?


In that case replacing CTE optimization fence with "OFFSET 0" would be 
akin to painting yourself into a corner, waiting for the pain to dry, 
walking over to another corner and painting yourself into that one.


cheers

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-01 Thread Tomas Vondra

On 05/01/2017 06:22 AM, Pavel Stehule wrote:



2017-05-01 1:21 GMT+02:00 Andres Freund <and...@anarazel.de
<mailto:and...@anarazel.de>>:

On 2017-04-30 07:19:21 +0200, Pavel Stehule wrote:
> why we cannot to introduce GUC option - enable_cteoptfence ?

Doesn't really solve the issue, and we've generally shied away from GUCs
that influence behaviour after a few bad experiences.  What if you want
one CTE inlined, but another one not?


It change behave in same sense like enable_nestloop, enable_hashjoin,
... with same limits.



Those (and also the other enable_*) GUCs are a great example why we 
should not use GUCs for tweaking planner behavior, except perhaps for 
the purpose of investigation. It's an extremely blunt tool.


You typically want to affect just a single node in the query plan (say, 
one join), but those options don't allow you to do that. It's all or 
nothing thing.


Even if you're OK with affecting the whole query, it's a separate 
control channel - it's not embedded in the query, the user has to set it 
somehow. So you either set it for the whole session (affecting all the 
other queries that don't really need it), or you set it before each 
query. Which however sucks for a number of reasons, e.g. if you have a 
slow query in the log, how do you know with what GUC values it was 
executed? (You don't, and there's no way to find out.)


Exactly the same issues would affect this new GUC. It would be 
impossible to use multiple CTEs in the query with different fencing 
behavior, and it would be just as difficult to investigate.


So no more planner-affecting GUCs, please, particularly if we expect 
regular users to use them.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-04-30 Thread Tomas Vondra

On 04/30/2017 09:46 AM, Andres Freund wrote:

Hi,

On 2017-04-30 13:58:14 +0800, Craig Ringer wrote:

We have OFFSET 0 for anyone really depending on it, and at least when you
read that you know to go "wtf" and look at the manual, wheras the CTE fence
behaviour is invisible and silent.


I don't think that's a good idea.  What if you need to prevent inlining
of something that actually needs an offset? What if the behaviour of
offset is ever supposed to change?  Relying more on that seems to just
be repeating the mistake around CTEs.



I agree with this. But OFFSET 0 would force people to modify the queries 
anyway, so why not just introduce a new keyword instead? Something like:


WITH FENCED a (SELECT ...)

But I think something like that was proposed not too long ago, and did 
not make it for some reason.


There's a lot of other CTE improvements that would be great. Say, being 
able to define indexes on them, but that's really a separate topic.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-04-30 Thread Tomas Vondra

On 04/30/2017 06:28 AM, Tom Lane wrote:

Craig Ringer <craig.rin...@2ndquadrant.com> writes:

- as you noted, it is hard to decide when it's worth inlining vs
materializing for CTE terms referenced more than once.


[ raised eyebrow... ]  Please explain why the answer isn't trivially
"never".

There's already a pretty large hill to climb here in the way of
breaking peoples' expectations about CTEs being optimization
fences.  Breaking the documented semantics about CTEs being
single-evaluation seems to me to be an absolute non-starter.



I'm not sure that's a universal expectation, though. I know there are 
people who actually do rely on that intentionally, no doubt about that. 
And we'd nee to make it work for them.


But I keep running into people who face serious performance issues 
exactly because not realizing this, and using CTEs as named subqueries. 
And when I tell them "optimization fence" they react "Whaaat?"


If I had to make up some numbers, I'd say the "What?" group is about 
10x the group of people who intentionally rely on CTEs being 
optimization fences.


FWIW I don't know how to do this. There were multiple attempts at this 
in the past, none of them succeeded. But perhaps we could at least 
propagate some of the CTE features, so that the outside query can 
benefit from that (e.g. when the CTE is sorted, we could set the 
sortkeys). That wouldn't break the fence thing, but it would allow other 
stuff.


regard

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] to-do item for explain analyze of hash aggregates?

2017-04-24 Thread Tomas Vondra

On 04/24/2017 10:55 PM, Jeff Janes wrote:

On Mon, Apr 24, 2017 at 12:13 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:

On 04/24/2017 08:52 PM, Andres Freund wrote:

...

I've wanted that too.  It's not impossible at all.


Why wouldn't that be possible? We probably can't use exactly the
same approach as Hash, because hashjoins use custom hash table while
hashagg uses dynahash IIRC. But why couldn't measure the amount of
memory by looking at the memory context, for example?


He said "not impossible", meaning it is possible.



Ah, the dreaded double negative ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] to-do item for explain analyze of hash aggregates?

2017-04-24 Thread Tomas Vondra

On 04/24/2017 08:52 PM, Andres Freund wrote:

On 2017-04-24 11:42:12 -0700, Jeff Janes wrote:

The explain analyze of the hash step of a hash join reports something like
this:

   ->  Hash  (cost=458287.68..458287.68 rows=24995368 width=37) (actual
rows=24995353 loops=1)
 Buckets: 33554432  Batches: 1  Memory Usage: 2019630kB


Should the HashAggregate node also report on Buckets and Memory Usage?  I
would have found that useful several times.  Is there some reason this is
not wanted, or not possible?


I've wanted that too.  It's not impossible at all.



Why wouldn't that be possible? We probably can't use exactly the same 
approach as Hash, because hashjoins use custom hash table while hashagg 
uses dynahash IIRC. But why couldn't measure the amount of memory by 
looking at the memory context, for example?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Do we need multi-column frequency/histogram stats? WAS Re: [HACKERS] Statistics "dependency"

2017-04-23 Thread Tomas Vondra

On 04/23/2017 04:16 PM, Bruce Momjian wrote:

On Sun, Apr 23, 2017 at 10:01:16AM -0400, Bruce Momjian wrote:

On Sun, Apr 23, 2017 at 11:44:12AM +0100, Simon Riggs wrote:

For us "functional dependency" would sound like something to do with
functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
to be the best term for this.

There are multiple statistics for dependency stored, hence
"dependencies". I don't like it, but its the best term I can see at
present.


OK, thank you for the reply, and I am sorry I forgot the previous
discussion.  I just wanted to re-check we had research this.  Thanks.


(Email subject updated.)

Actually, I have a larger question that I was thinking about.  Because
we already have lots of per-column stats, and now the dependency score,
is it possible to mix the per-column stats and dependency score in a way
that multi-column frequency/histogram stats are not necessary?  That
might be a less costly approach I had not considered.


Certainly not. Functional dependencies are "global" statistics, and only 
a very specific type of it. It only tells you that a particular column 
"implies" another column, i.e. knowledge of a value in A means there's 
only a single possible value in "B". That has a number of implications:


* It only works for equality conditions. No inequalities or so.

* It assumes the queries are "consistent" with the functional dependencies.

* There are dependencies/correlations that are not functional 
dependencies, while MCV/histograms would help.


Functional dependencies was the simplest type of extended statistics, 
and so it was the first one to implement (and introduce all the 
infrastructure). But we still need the other types.


The other types of statistics actually track correlation between values 
in the columns, not just "column A implies column B".



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Statistics "dependency"

2017-04-23 Thread Tomas Vondra

On 04/23/2017 12:44 PM, Simon Riggs wrote:

On 23 April 2017 at 09:17, Dean Rasheed <dean.a.rash...@gmail.com> wrote:

On 23 April 2017 at 03:37, Bruce Momjian <br...@momjian.us> wrote:

In looking at the new multi-column statistics "dependency" option in
Postgres 10, I am quite confused by the term "dependency".  Wouldn't
"correlation" be clearer and less confusing as "column dependency"
already means something else.



I also asked that exactly that question...


Actually, the terms "dependency" and "correlation" are both quite
broad terms that cover a whole range of other different things, and
hence could be misleading. The precise term for this is "functional
dependency" [1], so if anything, the option name should be
"functional_dependencies" or some shortening of that, keeping a part
of each of those words.


...and got that answer also.

For us "functional dependency" would sound like something to do with
functions (e.g. CREATE FUNCTION), so just "dependency" appears to me
to be the best term for this.



Not really. Functional dependency is a term well-defined in relational 
algebra, particularly in definition of normal forms. It has nothing to 
do with functions, and I'm sure it's not the only possible "ambiguity".


I actually considered functional_dependency when working on this, and I 
think it might be a better choice, but seemed a bit longish.



There are multiple statistics for dependency stored, hence
"dependencies". I don't like it, but its the best term I can see at
present.


That is a good point, actually. It should probably be plural.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-04-20 Thread Tomas Vondra

On 04/21/2017 12:13 AM, Tom Lane wrote:

Alvaro Herrera <alvhe...@2ndquadrant.com> writes:

Simon just pointed out that having the WITH clause appear in the middle
of the CREATE STATISTICS command looks odd; apparently somebody else
already complained on list about the same.  Other commands put the WITH
clause at the end, so perhaps we should do likewise in the new command.



Here's a patch to implement that.  I verified that if I change
qualified_name to qualified_name_list, bison does not complain about
conflicts, so this new syntax should support extension to multiple
relations without a problem.


Yeah, WITH is fully reserved, so as long as the clause looks like
WITH ( stuff... ) you're pretty much gonna be able to drop it
wherever you want.


Discuss.


+1 for WITH at the end; the existing syntax looks weird to me too.



-1 from me

I like the current syntax more, and  WHERE ... WITH seems a bit weird to 
me. But more importantly, one thing Dean probably considered when 
proposing the current syntax was that we may add support for partial 
statistics, pretty much like partial indexes. And we don't allow WITH at 
the end (after WHERE) for indexes:


test=# create index on t (a) where a < 100 with (fillfactor=10);
ERROR:  syntax error at or near "with"
LINE 1: create index on t (a) where a < 100 with (fillfactor=10);
^
test=# create index on t (a) with (fillfactor=10) where a < 100;


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_statistic_ext.staenabled might not be the best column name

2017-04-12 Thread Tomas Vondra



On 04/12/2017 03:36 PM, David Rowley wrote:


"stakind" seems like a better name. I'd have personally gone with
"statype" but pg_statistic already thinks stakind is better.


+1 to stakind

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stats_ext view does not seem all that useful

2017-04-10 Thread Tomas Vondra

On 04/10/2017 12:12 PM, David Rowley wrote:

During my review and time spent working on the functional dependencies
part of extended statistics I wondered what was the use for the
pg_stats_ext view. I was unsure why the length of the serialised
dependencies was useful.

Perhaps we could improve the view, but I'm not all that sure what
value it would add. Maybe we need to discuss this, but in the
meantime, I've attached a patch which just removes the view completely



Yeah, let's get rid of the view. It was quite useful before introducing 
the custom data types (and implicit casts to text), because 
pg_statistic_ext would simply return the whole bytea value. But since 
then the view kinda lost the purpose and no one really noticed.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strange parallel query behavior after OOM crashes

2017-04-10 Thread Tomas Vondra

On 04/10/2017 01:39 PM, Kuntal Ghosh wrote:

On Thu, Apr 6, 2017 at 6:50 AM, Robert Haas <robertmh...@gmail.com> wrote:

On Wed, Apr 5, 2017 at 8:17 PM, Neha Khatri <nehakhat...@gmail.com> wrote:

The problem here seem to be the change in the max_parallel_workers value
while the parallel workers are still under execution. So this poses two
questions:

1. From usecase point of view, why could there be a need to tweak the
max_parallel_workers exactly at the time when the parallel workers are at
play.
2. Could there be a restriction on tweaking of max_parallel_workers while
the parallel workers are at play? At least do not allow setting the
max_parallel_workers less than the current # of active parallel workers.


Well, that would be letting the tail wag the dog.  The maximum value
of max_parallel_workers is only 1024, and what we're really worried
about here is seeing a value near PG_UINT32_MAX, which leaves a lot of
daylight.  How about just creating a #define that's used by guc.c as
the maximum for the GUC, and here we assert that we're <= that value?


I've added a GUC parameter named MAX_PARALLEL_WORKER_LIMIT and
asserted that the absolute difference between the two counts <= that
value, i.e.,
abs((int)(register_count - terminate_count)) <= MAX_PARALLEL_WORKER_LIMIT;

Because, it's possible that register count has overflowed but
terminate count has not yet overflowed. As a result, the difference in
unsigned integer will be near UINT32_MAX. Hence, we need the absolute
difference after typecasting the same to integer. This value should be
less than the max_parallel_workers upper limit.

I've attached both the patches for better accessibility. PFA.



At first I was like 'WTF? Why do we need a new GUC just becase of an 
assert?' but you're actually not adding a new GUC parameter, you're 
adding a constant which is then used as a max value for max for the two 
existing parallel GUCs.


I think this is fine.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance issue with postgres9.6

2017-04-07 Thread Tomas Vondra

On 04/07/2017 06:31 PM, Merlin Moncure wrote:

On Fri, Apr 7, 2017 at 5:16 AM, Prakash Itnal <prakash...@gmail.com> wrote:

Hello,

We currently use psotgres 9.3 in our products. Recently we upgraded to
postgres 9.6. But with 9.6 we have seen a drastic reduction in throughput.
After analyzing carefully I found that "planner time" in 9.6 is very high.
Below are the details:

Scenario:
1 Create a table with 10 rows.
2 Execute simple query: select * from subscriber where s_id = 100;
3 No update/delete/insert; tried vacuum, full vacuum; by default we enable
auto-vacuum

9.3: Avg of "Total runtime" : 0.24ms [actual throughput: 650 TPS]
9.6: Avg of Total time: 0.56ms (Avg of "Planning time" : 0.38ms + Avg of
"Execution time" : 0.18ms) [actual throughput: 80 TPS]


I think your math is off.  Looking at your attachments, planning time
is 0.056ms, not 0.56ms.  This is in no way relevant to performance on
the order of your measured TPS.   How are you measuring TPS?



Not sure where did you get the 0.056ms? What I see is this in the 9.3 
explains:


 Total runtime: 0.246 ms

and this in those from 9.6:

 Planning time: 0.396 ms

 Execution time: 0.181 ms


That is roughly 0.25ms vs. 0.6ms (0.4+0.2), as reported by Prakash.

Obviously, this "just" 2x slowdown, so it does not match the drop from 
650 to 80 tps. Also, 0.25ms would be ~4000 tps, so I guess this was just 
an example of a query that slowed down.


Prakash, are you using packages (which ones?), or have you compiled from 
sources? Can you provide pg_config output from both versions, and also 
'select * from pg_settings' (the full config)?


It might also be useful to collect profiles, i.e. (1) install debug 
symbols (2) run the query in a loop and (3) collect profiles from that 
one backend using 'perf'.


I assume you're using the same hardware / machine for the tests?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2017-04-06 Thread Tomas Vondra

On 04/06/2017 11:45 PM, David Steele wrote:

On 4/6/17 5:05 PM, Tomas Vondra wrote:

On 04/06/2017 08:33 PM, David Steele wrote:

On 4/5/17 7:29 AM, Simon Riggs wrote:


2. It's not clear to me the advantage of being able to pick varying
filesizes. I see great disadvantage in having too many options, which
greatly increases the chance of incompatibility, annoyance and
breakage. I favour a small number of values that have been shown by
testing to be sweet spots in performance and usability. (1GB has been
suggested)


I'm in favor of 16,64,256,1024.



I don't see a particular reason for this, TBH. The sweet spots will be
likely dependent hardware / OS configuration etc. Assuming there
actually are sweet spots - no one demonstrated that yet.


Fair enough, but my feeling is that this patch has never been about
server performance, per se.  Rather, is is about archive management and
trying to stem the tide of WAL as servers get bigger and busier.
Generally, archive commands have to make a remote connection to offload
WAL and that has a cost per segment.



Perhaps, although Robert also mentioned that the fsync at the end of 
each WAL segment is noticeable. But the thread is a bit difficult to 
follow, different people have different ideas about the motivation of 
the patch, etc.



Also, I don't see how supporting additional WAL sizes increases chance
of incompatibility. We already allow that, so either the tools (e.g.
backup solutions) assume WAL segments are always 16MB (in which case are
essentially broken) or support valid file sizes (in which case they
should have no issues with the new ones).


I don't see how a compile-time option counts as "supporting that" in
practice.  How many people in the field are running custom builds of
Postgres?  And of those, how many have changed the WAL segment size?
I've never encountered a non-standard segment size or talked to anyone
who has.  I'm not saying it has *never* happened but I would venture to
say it's rare.



I agree it's rare, but I don't think that means we can just consider the 
option as 'unsupported'. We're even mentioning it in the docs as a valid 
way to customize granularity of the WAL archival.


I certainly know people who run custom builds, and some of them run with 
custom WAL segment size. Some of them are our customers, some are not. 
And yes, some of them actually patched the code to allow 256MB WAL segments.



If we're going to do this, I'm in favor of deciding some reasonable
upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values
up to that limit.


I'm OK with that.  I'm also OK with providing a few reasonable choices.
I guess that means I'll just go with the majority opinion.


3. New file allocation has been a problem raised with this patch for
some months now.


I've been playing around with this and I don't think short tests show
larger sizes off to advantage.  Larger segments will definitely perform
more poorly until Postgres starts recycling WAL.  Once that happens I
think performance differences should be negligible, though of course
this needs to be verified with longer-running tests.


I'm willing to do some extensive performance testing on the patch. I
don't see how that could happen in the next few day (before the feature
freeze), particularly considering we're interested in long tests.


Cool.  I've been thinking about how to do some meaningful tests for this
(mostly pgbench related).  I'd like to hear what you are thinking.



My plan was to do some pgbench tests with different workloads, scales 
(in shared buffers, in RAM, exceeds RAM), and different storage 
configurations (SSD vs. HDD, WAL/datadir on the same/different 
device/fs, possibly also ext4/xfs).



The question however is whether we need to do this testing when we don't
actually change the default (at least the patch submitted on 3/27 does
seem to keep the 16MB). I assume people specifying a custom value when
calling initdb are expected to know what they are doing (and I don't see
how we can prevent distros from choosing a bad value in their packages -
they could already do that with configure-time option).


Just because we don't change the default doesn't mean that others won't.
 I still think testing for sizes other than 16MB is severely lacking and
I don't believe caveat emptor is the way to go.



Aren't you mixing regression and performance testing? I agree we need to 
be sure all segment sizes are handled correctly, no argument here.



Do we actually have any infrastructure for that? Or do you plan to add
some new animals with different WAL segment sizes?


I don't have plans to add animals.  I think we'd need a way to tell
'make check' to use a different segment size for tests and then
hopefully reconfigure some of the existing animals.



OK. My point was that we don't have that capability now, and the latest 
patch is not adding it either.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL D

Re: [HACKERS] increasing the default WAL segment size

2017-04-06 Thread Tomas Vondra

On 04/06/2017 08:33 PM, David Steele wrote:

On 4/5/17 7:29 AM, Simon Riggs wrote:

On 5 April 2017 at 06:04, Beena Emerson <memissemer...@gmail.com> wrote:


The  WALfilename - LSN mapping disruption for higher values you mean? Is
there anything else I have missed?


I see various issues raised but not properly addressed

1. we would need to drop support for segment sizes < 16MB unless we
adopt a new incompatible filename format.
I think at 16MB the naming should be the same as now and that
WALfilename -> LSN is very important.
For this release, I think we should just allow >= 16MB and avoid the
issue thru lack of time.


+1.


2. It's not clear to me the advantage of being able to pick varying
filesizes. I see great disadvantage in having too many options, which
greatly increases the chance of incompatibility, annoyance and
breakage. I favour a small number of values that have been shown by
testing to be sweet spots in performance and usability. (1GB has been
suggested)


I'm in favor of 16,64,256,1024.



I don't see a particular reason for this, TBH. The sweet spots will be 
likely dependent hardware / OS configuration etc. Assuming there 
actually are sweet spots - no one demonstrated that yet.


Also, I don't see how supporting additional WAL sizes increases chance 
of incompatibility. We already allow that, so either the tools (e.g. 
backup solutions) assume WAL segments are always 16MB (in which case are 
essentially broken) or support valid file sizes (in which case they 
should have no issues with the new ones).


If we're going to do this, I'm in favor of deciding some reasonable 
upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values 
up to that limit.



3. New file allocation has been a problem raised with this patch for
some months now.


I've been playing around with this and I don't think short tests show
larger sizes off to advantage.  Larger segments will definitely perform
more poorly until Postgres starts recycling WAL.  Once that happens I
think performance differences should be negligible, though of course
this needs to be verified with longer-running tests.



I'm willing to do some extensive performance testing on the patch. I 
don't see how that could happen in the next few day (before the feature 
freeze), particularly considering we're interested in long tests.


The question however is whether we need to do this testing when we don't 
actually change the default (at least the patch submitted on 3/27 does 
seem to keep the 16MB). I assume people specifying a custom value when 
calling initdb are expected to know what they are doing (and I don't see 
how we can prevent distros from choosing a bad value in their packages - 
they could already do that with configure-time option).



If archive_timeout is set then there will also be additional time taken
to zero out potentially larger unused parts of the segment.  I don't see
this as an issue, however, because if archive_timeout is being triggered
then the system is very likely under lower than usual load.


Lack of clarity and/or movement on these issues is very, very close to
getting the patch rejected now. Let's not approach this with the
viewpoint that I or others want it to be rejected, lets work forwards
and get some solid changes that will improve the world without causing
problems.


I would definitely like to see this go in, though I agree with Peter
that we need a lot more testing.  I'd like to see some build farm
animals testing with different sizes at the very least, even if there's
no time to add new tests.



Do we actually have any infrastructure for that? Or do you plan to add 
some new animals with different WAL segment sizes?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   7   8   9   10   >