Re: [HACKERS] what to revert

2016-05-10 Thread Noah Misch
On Tue, May 10, 2016 at 10:05:13AM -0500, Kevin Grittner wrote:
> On Tue, May 10, 2016 at 9:02 AM, Tom Lane  wrote:
> > Kevin Grittner  writes:
> >> There were 75 samples each of "disabled" and "reverted" in the
> >> spreadsheet.  Averaging them all, I see this:
> >
> >> reverted:  290,660 TPS
> >> disabled:  292,014 TPS
> >
> >> That's a 0.46% overall increase in performance with the patch,
> >> disabled, compared to reverting it.  I'm surprised that you
> >> consider that to be a "clearly measurable difference".  I mean, it
> >> was measured and it is a difference, but it seems to be well within
> >> the noise.  Even though it is based on 150 samples, I'm not sure we
> >> should consider it statistically significant.
> >
> > You don't have to guess about that --- compare it to the standard
> > deviation within each group.
> 
> My statistics skills are rusty, but I thought that just gives you
> an effect size, not any idea of whether the effect is statistically
> significant.

I discourage focusing on the statistical significance, because the hypothesis
in question ("Applying revert.patch to 4bbc1a7e decreases 'pgbench -S -M
prepared -j N -c N' tps by 0.46%.") is already an unreliable proxy for
anything we care about.  PostgreSQL performance variation due to incidental,
ephemeral binary layout motion is roughly +/-5%.  Assuming perfect confidence
that 4bbc1a7e+revert.patch is 0.46% slower than 4bbc1a7e, the long-term effect
of revert.patch could be anywhere from -5% to +4%.

If one wishes to make benchmark-driven decisions about single-digit
performance changes, one must control for binary layout effects:
http://www.postgresql.org/message-id/87vbitb2zp@news-spur.riddles.org.uk
http://www.postgresql.org/message-id/20160416204452.ga1910...@tornado.leadboat.com

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump vs. TRANSFORMs

2016-05-10 Thread Jim Nasby

On 5/5/16 8:33 AM, Stephen Frost wrote:

I strongly disagree with the idea that this is only an issue with the
testing system.  What if we add functions in the future that are
created by initdb and *are* useful for transforms?  What about casts?
There are a lot of functions in pg_catalog that a user might wish to use
to define their own cast.  This also doesn't do anything about users
creating functions in pg_catalog.


+1 to all of that. I know that I've at least created casts using 
built-in functions during testing, and I think I might be doing it in 
some of my extensions.


As for transforms, I suspect we're going to end up with some of those in 
initdb in the future, because it's currently the only way you can 
improve existing type transformations without breaking existing PL code.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread David E. Wheeler
On May 10, 2016, at 5:56 PM, Fabrízio de Royes Mello  
wrote:

> Searching for the operator in pg_operator catalog isn't enought?

Seems like overkill, but will do if there’s nothing else.

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Re: new tests post-feature freeze (was pgsql: Add TAP tests for pg_dump)

2016-05-10 Thread Jim Nasby

On 5/8/16 11:29 AM, Stephen Frost wrote:

My suggestion is that, from this point forward, we add new tests to
> 9.6 only if they are closely related to a bug that is getting fixed or
> a feature that is new in 9.6.  I think that's a reasonable compromise,
> but what do others think?

I'm willing to accept that compromise, but I'm not thrilled with it due
to what it will mean for the process I'm currently going through.  The
approach I've been using has been to add tests to gain more code
coverage of the code in pg_dump.  That has turned up multiple
pre-existing bugs in pg_dump but the vast majority of the tests come
back clean.  This compromise would mean that I'd continue to work
through the code coverage tests, but would have to segregate out and
commit only those tests which actually reveal bugs, once those bugs have
been fixed (as to avoid turning the buildfarm red).  The rest of the
tests would still get written, but since they currently don't reveal
bugs, they would be shelved until development is opened for 9.7.


Having done extensive database unit testing in the past, I've 
experienced what Stephen's talking about and agree it's a real pain. 
With tap-style tests (or really anything that's not dependent on 
something as fragile as diffing output), there's pretty low risk to 
adding more tests that are passing.


As for tests distracting people from reviewing patches, robust tests 
significantly reduce the need for manual review. I think it's a much 
better approach to take a methodical approach of writing tests to help 
verify a feature works than just randomly banging on it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread David E. Wheeler
On May 10, 2016, at 6:14 PM, Tom Lane  wrote:

> Given that you're coercing both one input value and the result to text,
> I don't understand why you don't just compare the text representations.

Because sometimes the text is not equal when the casted text is. Consider

'foo'::citext = 'FOO':citext

> I'm also not very clear on what you mean by "comparing column defaults".
> A column default is an expression (in the general case anyway), not just
> a value of the type.

Yeah, the pgTAP column_default_is() function takes a string representation of 
an expression.

> Maybe if you'd shown us the is() function, as well as a typical usage
> of _def_is(), this would be less opaque.

Here’s is():

CREATE OR REPLACE FUNCTION is (anyelement, anyelement, text)
RETURNS TEXT AS $$
DECLARE
result BOOLEAN;
output TEXT;
BEGIN
-- Would prefer $1 IS NOT DISTINCT FROM, but that's not supported by 
8.1.
result := NOT $1 IS DISTINCT FROM $2;
output := ok( result, $3 );
RETURN output || CASE result WHEN TRUE THEN '' ELSE E'\n' || diag(
   'have: ' || CASE WHEN $1 IS NULL THEN 'NULL' ELSE 
$1::text END ||
E'\nwant: ' || CASE WHEN $2 IS NULL THEN 'NULL' ELSE 
$2::text END
) END;
END;
$$ LANGUAGE plpgsql;

_def_is() is called by another function, which effectively is:

CREATE OR REPLACE FUNCTION _cdi ( NAME, NAME, NAME, anyelement, TEXT )
RETURNS TEXT AS $$
BEGIN

RETURN _def_is(
pg_catalog.pg_get_expr(d.adbin, d.adrelid),
pg_catalog.format_type(a.atttypid, a.atttypmod),
$4, $5
)
  FROM pg_catalog.pg_namespace n, pg_catalog.pg_class c, 
pg_catalog.pg_attribute a,
   pg_catalog.pg_attrdef d
 WHERE n.oid = c.relnamespace
   AND c.oid = a.attrelid
   AND a.atthasdef
   AND a.attrelid = d.adrelid
   AND a.attnum = d.adnum
   AND n.nspname = $1
   AND c.relname = $2
   AND a.attnum > 0
   AND NOT a.attisdropped
   AND a.attname = $3;
END;
$$ LANGUAGE plpgsql;

That function si called like this:

_cdi( :schema, :table, :column, :default, :description );

Best,

David





smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Jim Nasby

On 5/10/16 11:42 PM, Jim Nasby wrote:

On 5/6/16 4:55 PM, Peter Geoghegan wrote:

On Fri, May 6, 2016 at 2:49 PM, Andres Freund  wrote:

Jeff Janes has done astounding work in these matters.  (I don't think
we credit him enough for that.)


+many.


Agreed. I'm a huge fan of what Jeff has been able to do in this area.
I often say so. It would be even better if Jeff's approach to testing
was followed as an example by other people, but I wouldn't bet on it
ever happening. It requires real persistence and deep understanding to
do well.


It takes deep understanding to *design* the tests, not to write them.
There's a lot of folks out there that will never understand enough to
design tests meant to expose data corruption but who could easily code
someone else's design, especially if we provided tools/ways to tweak a
cluster to make testing easier/faster (such as artificially advancing
XID/MXID).


Speaking of which, another email in the thread made me realize that 
there's a test condition no one has mentioned: verifying we don't lose 
tuples after wraparound.


To test this, you'd want a table that's mostly frozen. Ideally, dirty a 
single tuple on a bunch of frozen pages, with committed updates, 
deletes, and un-committed inserts. Advance XID far enough to get you 
close to wrap-around. Do a vacuum, SELECT count(*), advance XID past 
wraparound, SELECT count(*) again and you should get the same number.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perf Benchmarking and regression.

2016-05-10 Thread Ashutosh Sharma
Hi Andres,

I am extremely sorry for the delayed response.  As suggested by you, I have
taken the performance readings at 128 client counts after making the
following two changes:

*1).* Removed AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL,
NULL); from pq_init(). Below is the git diff for the same.

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 8d6eb0b..399d54b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -206,7 +206,9 @@ pq_init(void)
AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
MyProcPort->sock,
  NULL, NULL);
AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
+#if 0
AddWaitEventToSet(FeBeWaitSet, WL_POSTMASTER_DEATH, -1, NULL, NULL);
+#endif

*2).* Disabled the guc vars "bgwriter_flush_after",
"checkpointer_flush_after" and "backend_flush_after" by setting them to
zero.

After doing the above two changes below are the readings i got for 128
client counts:

*CASE :* Read-Write Tests when data exceeds shared buffers.

Non Default settings and test
./postgres -c shared_buffers=8GB -N 200 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c
checkpoint_completion_target=0.9 &

./pgbench -i -s 1000 postgres

./pgbench -c 128 -j 128 -T 1800 -M prepared postgres

*Run1 :* tps = 9690.678225
*Run2 :* tps = 9904.320645
*Run3 :* tps = 9943.547176

Please let me know if i need to take readings with other client counts as
well.

*Note:* I have taken these readings on postgres master head at,

commit 91fd1df4aad2141859310564b498a3e28055ee28
Author: Tom Lane 
Date:   Sun May 8 16:53:55 2016 -0400

With Regards,
Ashutosh Sharma
EnterpriseDB: *http://www.enterprisedb.com *

On Wed, May 11, 2016 at 3:53 AM, Andres Freund  wrote:

> Hi,
>
> On 2016-05-06 21:21:11 +0530, Mithun Cy wrote:
> > I will try to run the tests as you have suggested and will report the
> same.
>
> Any news on that front?
>
> Regards,
>
> Andres
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Jim Nasby

On 5/6/16 4:08 PM, Joshua D. Drake wrote:


VACUUM THEWHOLEDAMNTHING


+100

(hahahaha)


You know what? Why not? Seriously? We aren't product. This is supposed
to be a bit fun. Let's have some fun with it? It would be so easy to
turn that into a positive advocacy opportunity.


Honestly, for an option this obscure, I agree. I don't think we'd want 
any normally used stuff named so glibly, but I sure as heck could have 
used some easter-eggs like this when I was doing training.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Jim Nasby

On 5/6/16 4:55 PM, Peter Geoghegan wrote:

On Fri, May 6, 2016 at 2:49 PM, Andres Freund  wrote:

Jeff Janes has done astounding work in these matters.  (I don't think
we credit him enough for that.)


+many.


Agreed. I'm a huge fan of what Jeff has been able to do in this area.
I often say so. It would be even better if Jeff's approach to testing
was followed as an example by other people, but I wouldn't bet on it
ever happening. It requires real persistence and deep understanding to
do well.


It takes deep understanding to *design* the tests, not to write them. 
There's a lot of folks out there that will never understand enough to 
design tests meant to expose data corruption but who could easily code 
someone else's design, especially if we provided tools/ways to tweak a 
cluster to make testing easier/faster (such as artificially advancing 
XID/MXID).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Jim Nasby

On 5/6/16 4:20 PM, Andres Freund wrote:

On 2016-05-06 14:15:47 -0700, Josh berkus wrote:

For the serious testing, does anyone have a good technique for creating
loads which would stress-test vacuum freezing?  It's hard for me to come
up with anything which wouldn't be very time-and-resource intensive
(like running at 10,000 TPS for a week).


I've changed the limits for freezing options a while back, so you can
now set autovacuum_freeze_max as low as 10 (best set
vacuum_freeze_table_age accordingly).  You'll have to come up with a
workload that doesn't overwrite all data continuously (otherwise
there'll never be old rows), but otherwise it should now be fairly easy
to test that kind of scenario.


There's also been a tool for forcibly advancing XID floating around for 
quite some time. Using that could have the added benefit of verifying 
anti-wrap still works correctly. (Might be worth testing mxid wrap too...)

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel.c is not marked as test covered

2016-05-10 Thread Tom Lane
Robert Haas  writes:
> On Mon, May 9, 2016 at 11:11 AM, Tom Lane  wrote:
>> regression=# set force_parallel_mode TO on;
>> SET
>> regression=# explain select count(*) from tenk1;
>> QUERY PLAN
>> -
>> Aggregate  (cost=483.00..483.01 rows=1 width=8)
>> ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=0)
>> (2 rows)
>> 
>> Methinks force_parallel_mode is a bit broken.

> Hmm, that is strange.  I would have expected that to stuff a Gather on
> top of the Aggregate.  I wonder why it's not doing that.

The reason is that create_plain_partial_paths() contains a hard-wired
decision not to generate any partial paths for relations smaller than
1000 blocks, which means that no partial path will ever be generated
for *any* relation in the standard regression tests, force_parallel_mode
or no.

I would just go fix this, along the lines of

*** create_plain_partial_paths(PlannerInfo *
*** 702,708 
 * with all of its inheritance siblings it may well pay off.
 */
if (rel->pages < parallel_threshold &&
!   rel->reloptkind == RELOPT_BASEREL)
return;
  
/*
--- 703,710 
 * with all of its inheritance siblings it may well pay off.
 */
if (rel->pages < parallel_threshold &&
!   rel->reloptkind == RELOPT_BASEREL &&
!   force_parallel_mode == FORCE_PARALLEL_OFF)
return;
  
/*

except that doing so and running the regression tests with
force_parallel_mode = on results in core dumps.  Some debugging seems
indicated ... and we should at this point assume that there has been no
useful testing of parallel query in the buildfarm, not even on Noah's
machines.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-05-10 Thread Craig Ringer
On 11 May 2016 at 06:47, Vladimir Gordiychuk  wrote:

> Same thread, I just think these are two somewhat separate changes. One is
>> just in the walsender and allows return to command mode during waiting for
>> WAL. The other is more intrusive into the reorder buffer etc and allows
>> aborting decoding during commit processing. So two separate patches make
>> sense here IMO, one on top of the other.
>
>
> About the second part of the patch. What the reason decode and send whole
> transaction? Why we can't process logical decoding via WalSndLoop LSN by
> LSN as it work in physycal replication? For example if transaction contains
> in more them one LSN, first we decode and send "begin", "part data from
> current LSN" and then returns to WalSndLoop on the next iteration we send
> "another part data", "commit". I don't research in this way, because I
> think it will be big changes in comparison callback that stop sending.
>

There are two parts to that. First, why do we reorder at all, accumulating
a whole transaction in a reorder buffer until we see a commit then sending
it all at once? Second, when sending, why don't we return control to the
walsender between messages?

For the first: reordering xacts server-side lets the client not worry about
replay order. It just applies them as it receives them. It means the server
can omit uncommitted transactions from the stream entirely and clients can
be kept simple(r). IIRC there are also some issues around relcache
invalidation handling and time travel that make it desirable to wait until
commit before building a snapshot and decoding, but I haven't dug into the
details. Andres is the person who knows that area best.

As for why we don't return control to the walsender between change
callbacks when processing a reorder buffer at commit time, I'm not really
sure but suspect it's mostly down to easy API and development. If control
returned to the walsender between each change we'd need an async api for
the reorder buffer where you test to see if there's more unprocessed work
and call back into the reorder buffer again if there is. So the reorder
buffer has to keep state for the progress of replaying a commit in a
separate struct, handle repeated calls to process work, etc. Also, since
many individual changes are very small that could lead to a fair bit of
overhead; it'd probably want to process a batch of changes then return.
Which makes it even more complicated.

If it returned control to the caller between changes then each caller would
also need to have the logic to test for more work and call back into the
reorder buffer. Both the walsender and SQL interface would need it. The way
it is, the loop is just in one place.

It probably makes more sense to have a callback that can test state and
abort processing, like you've introduced. The callback could probably even
periodically check to see if there's client input to process and consume it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread Peter Eisentraut

On 5/10/16 9:16 PM, David G. Johnston wrote:

Brute force: you'd have to query pg_amop and note the absence of a row
with a btree (maybe hash too...) family strategy 3 (1 for hash)
[equality] where the left and right types are the same and match the
type in question.


While these are good thoughts, the implementation of DISTINCT actually 
looks for an operator that is literally named "=".


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Accurate list of Keywords / Datatypes?

2016-05-10 Thread Robins Tharakan
On 8 May 2016 at 07:38, Euler Taveira  wrote:

> src/include/parser/kwlist.h


​Thanks Everyone.

Kwlist.h got me going as to where to look, but I had to improvise (quite) a
bit. Any and all advice about any other authoritative lists (like kwlist.h)
would be a big help, since currently much is kludgy.

Created a GitHub Repo in case someone's interested (
https://github.com/robins/npp-lang-plpgsql).

​For PLPGSQL, (src/pl/plpgsql/src/pl_scanner.c)​

For DataTypes, (src/interfaces/ecpg/preproc/c_keywords.c
; src/backend/bootstrap/bootstrap.c) as well as parse all HTML pages under
http://www.postgresql.org/docs/devel/static/datatype.html.

For Config Parameters (recovery.sample.conf ; postgresql.sample.conf) as
well as parse all HTML pages under
http://www.postgresql.org/docs/devel/static/runtime-config.html ;
http://www.postgresql.org/docs/devel/static/libpq-connect.html

For Extensions (Parse HTML:
http://www.postgresql.org/docs/devel/static/contrib.html)

-
robins


Re: [HACKERS] alter table alter column ... (larger type) ... when there are dependent views

2016-05-10 Thread Tom Lane
Euler Taveira  writes:
> On 10-05-2016 20:59, Rob Bygrave wrote:
>> Having read all the previous discussions on the general topic of
>> altering tables with dependent views I realise this is a complex and
>> difficult issue in general but I'd like to see if there was some support
>> for looking at these 3 more specific changes. 
>> 1. making a varchar column larger  e.g. varchar(10) -> varchar(20)
>> 2. changing int to bigint
>> 3. changing varchar to text

> It seems an useful feature request.

FWIW, I think thinking of it like that will almost certainly fall foul
of Polya's Inventor's Paradox.  A solution, if practical at all, will
very likely be simpler as well as more useful if it's datatype
independent.

You should look at the code in ALTER TABLE that tries to rebuild index
definitions during ALTER COLUMN TYPE, and see if that can be adapted
to updating views.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Accurate list of Keywords / Datatypes?

2016-05-10 Thread Euler Taveira
On 10-05-2016 22:22, David G. Johnston wrote:
> I don't know how the docs, this function, and the source code relate to
> each other.
> 
Function is built around ScanKeywords structure which in turn is built
using parser/kwlist.h.


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Accurate list of Keywords / Datatypes?

2016-05-10 Thread Michael Paquier
On Wed, May 11, 2016 at 10:22 AM, David G. Johnston
 wrote:
> On Saturday, May 7, 2016, Euler Taveira  wrote:
>>
>> On 07-05-2016 22:53, Robins Tharakan wrote:
>> > Should I be looking somewhere else? Parse keywords from Git Source file
>> > (if so where)? Parse PG Documentation?
>> >
>> src/include/parser/kwlist.h
>>
>>
>
>  http://www.postgresql.org/docs/9.5/interactive/functions-info.html
>
> SELECT * FROM pg_get_keywords();
>
> I don't know how the docs, this function, and the source code relate to each
> other.

ScanKeywords is fed directly from kwlist.h, have for example a look at
the top of src/common/keywords.c. Then NumScanKeywords is calculated
from the list generated. pg_get_keywords is at the end making use of
those structures generated.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Masahiko Sawada
On Tue, May 10, 2016 at 11:30 PM, Robert Haas  wrote:
> On Mon, May 9, 2016 at 7:40 PM, Ants Aasma  wrote:
>> On Mon, May 9, 2016 at 10:53 PM, Robert Haas  wrote:
>>> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada  
>>> wrote:
 Attached draft patch adds SCANALL option to VACUUM in order to scan
 all pages forcibly while ignoring visibility map information.
 The option name is SCANALL for now but we could change it after got 
 consensus.
>>>
>>> If we're going to go that way, I'd say it should be scan_all rather
>>> than scanall.  Makes it clearer, at least IMHO.
>>
>> Just to add some diversity to opinions, maybe there should be a
>> separate command for performing integrity checks. Currently the best
>> ways to actually verify database correctness do so as a side effect.
>> The question that I get pretty much every time after I explain why we
>> have data checksums, is "how do I check that they are correct" and we
>> don't have a nice answer for that now. We could also use some ways to
>> sniff out corrupted rows that don't involve crashing the server in a
>> loop. Vacuuming pages that supposedly don't need vacuuming just to
>> verify integrity seems very much in the same vein.
>>
>> I know right now isn't exactly the best time to hastily slap on such a
>> feature, but I just wanted the thought to be out there for
>> consideration.
>
> I think that it's quite reasonable to have ways of performing an
> integrity check that are separate from VACUUM, but this is about
> having a way to force VACUUM to scan all-frozen pages

Or second way I came up with is having tool to remove particular _vm
file safely, which is executed via SQL or client tool like
pg_resetxlog.

Attached updated VACUUM SCAN_ALL patch.
Please find it.

Regards,

--
Masahiko Sawada
diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 19fd748..8f63fad 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -21,9 +21,9 @@ PostgreSQL documentation
 
  
 
-VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ table_name ]
-VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]
+VACUUM [ ( { FULL | FREEZE | VERBOSE | ANALYZE | SCAN_ALL } [, ...] ) ] [ table_name [ (column_name [, ...] ) ] ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCAN_ALL ] [ table_name ]
+VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ SCAN_ALL ] ANALYZE [ table_name [ (column_name [, ...] ) ] ]
 
  
 
@@ -120,6 +120,17 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ 

 

+SCAN_ALL
+
+ 
+  Selects forcibly full page scanning vacuum while ignoring visibility map.
+  Forcibly full page scanning vacuum is always performed when the table is
+  rewritten so this option is redundant when FULL is specified.
+ 
+
+   
+   
+   
 ANALYZE
 
  
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 426e756..eee93c4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -138,7 +138,7 @@ static BufferAccessStrategy vac_strategy;
 
 /* non-export function prototypes */
 static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
-			   Relation *Irel, int nindexes, bool aggressive);
+			   Relation *Irel, int nindexes, bool aggressive, bool scan_all);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
 static void lazy_vacuum_index(Relation indrel,
@@ -185,6 +185,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	double		read_rate,
 write_rate;
 	bool		aggressive;		/* should we scan all unfrozen pages? */
+	bool		scan_all;		/* should we scan all pages forcibly? */
 	bool		scanned_all_unfrozen;	/* actually scanned all such pages? */
 	TransactionId xidFullScanLimit;
 	MultiXactId mxactFullScanLimit;
@@ -233,6 +234,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
 			  mxactFullScanLimit);
 
+	/* If SCAN_ALL option is specified, we have to scan all pages forcibly */
+	scan_all = options & VACOPT_SCANALL;
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
@@ -246,14 +250,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vacrelstats->hasindex = (nindexes > 0);
 
 	/* Do the vacuuming */
-	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive);
+	lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, aggressive, scan_all);
 
 	/* Done with indexes */
 	vac_close_indexes(nindexes, Irel, NoLock);
 
 	/*
-	 * Compute whether we actually scanned the whole relation. If we did, we
-	 

Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread Euler Taveira
On 10-05-2016 22:28, David G. Johnston wrote:
> Technically "is not distinct from" would be more correct.
> 
Ooops. Fat fingered the statement. Also, forgot to consider null case.

euler=# \pset null 'NULL'
Null display is "NULL".
euler=# select x.a, y.b, x.a IS NOT DISTINCT FROM y.b AS "INDF", x.a =
y.b AS "=" FROM (VALUES (3), (6), (NULL)) AS x (a), (VALUES (3), (6),
(NULL)) AS y (b);
  a   |  b   | INDF |  =
--+--+--+--
3 |3 | t| t
3 |6 | f| f
3 | NULL | f| NULL
6 |3 | f| f
6 |6 | t| t
6 | NULL | f| NULL
 NULL |3 | f| NULL
 NULL |6 | f| NULL
 NULL | NULL | t| NULL
(9 rows)


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter table alter column ... (larger type) ... when there are dependent views

2016-05-10 Thread Euler Taveira
On 10-05-2016 20:59, Rob Bygrave wrote:
> Having read all the previous discussions on the general topic of
> altering tables with dependent views I realise this is a complex and
> difficult issue in general but I'd like to see if there was some support
> for looking at these 3 more specific changes. 
> 1. making a varchar column larger  e.g. varchar(10) -> varchar(20)
> 2. changing int to bigint
> 3. changing varchar to text
> 
It seems an useful feature request. I've already crossed that bridge
while maintaining such a dynamic database schemas. I don't see why we
couldn't implement this feature. I'll take a stab at it for the first
commitfest.

> I have seen that there are people motivated enough to update
> pg_attribute directly (update pg_attribute a set a.atttypmod = 20 + 4 ...).
> 
Yuk, this definitely bypass the ATExecAlterColumnType(). It is a hack
and must be done with care.


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] release management team statement on patch reverts

2016-05-10 Thread Bruce Momjian
On Wed, May  4, 2016 at 01:50:39PM -0700, Andres Freund wrote:
> I also want to reiterate that I didn't immediately call for a revert,
> initially - before recognizing the architectural issue - I offered to
> write code to address the regressions due to the spinlocks.

I was the same case --- I didn't call for the patch to be reverted, but
rather considered for reversion:

   "I also strongly question whether we should revert this feature ..."

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread David G. Johnston
On Tuesday, May 10, 2016, Euler Taveira  wrote:
>
> Also, IS DISTINCT FROM is an alias for = operator per standard IIRC.
>

Technically "is not distinct from" would be more correct.   Alias implies
exact while in the presence of nulls the two behave differently.

"is distinct from" ~ "<>" which is the canonical form (alias) for "!="

http://www.postgresql.org/docs/9.5/interactive/functions-comparison.html

David J.


Re: [HACKERS] Accurate list of Keywords / Datatypes?

2016-05-10 Thread David G. Johnston
On Saturday, May 7, 2016, Euler Taveira  wrote:

> On 07-05-2016 22:53, Robins Tharakan wrote:
> > Should I be looking somewhere else? Parse keywords from Git Source file
> > (if so where)? Parse PG Documentation?
> >
> src/include/parser/kwlist.h


>
 http://www.postgresql.org/docs/9.5/interactive/functions-info.html

SELECT * FROM pg_get_keywords();

I don't know how the docs, this function, and the source code relate to
each other.

David J.


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread Euler Taveira
On 10-05-2016 21:12, David E. Wheeler wrote:
> This makes sense, of course, and I could fix it by comparing text
> values instead of json values when the values are JSON. But of course
> the lack of a = operator is not limited to JSON. So I’m wondering if
> there’s an interface at the SQL level to tell me whether a type has
> an = operator? That way I could always use text values in those
> situations.
> 
There isn't an equality notation at catalogs. You could try "SELECT
oprname FROM pg_operator WHERE oprcode::text ~ 'eq'" but it is too
fragile. You could also try oprname, oprrest or oprjoin but the result
is worse than the former solution. You definitely need a hack.

Also, IS DISTINCT FROM is an alias for = operator per standard IIRC.


-- 
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread David G. Johnston
On Tuesday, May 10, 2016, David E. Wheeler  wrote:

>
> This makes sense, of course, and I could fix it by comparing text values
> instead of json values when the values are JSON. But of course the lack of
> a = operator is not limited to JSON. So I’m wondering if there’s an
> interface at the SQL level to tell me whether a type has an = operator?
> That way I could always use text values in those situations.
>
>
http://www.postgresql.org/docs/9.5/interactive/catalog-pg-amop.html
http://www.postgresql.org/docs/9.5/interactive/xindex.html

Brute force: you'd have to query pg_amop and note the absence of a row with
a btree (maybe hash too...) family strategy 3 (1 for hash)
[equality] where the left and right types are the same and match the type
in question.

There is likely more to it - though absence is pretty much a given I'd be
concerned about false negatives due to ignoring other factors like
"amoppurpose".

In theory you should be able to trade off convenience for correctness by
calling:

to_regoperator('=(type,type)')

But I've never tried it and it assumes that = is the equality operator and
that its presence is sufficient.  I'm also guessing on the text type name
syntax.

http://www.postgresql.org/docs/9.5/interactive/functions-info.html

This option is a young one from what I remember.

David J.


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread Tom Lane
"David E. Wheeler"  writes:
> pgTAP has a function that compares two values of a given type, which it uses 
> for comparing column defaults. It looks like this:

> CREATE OR REPLACE FUNCTION _def_is( TEXT, TEXT, anyelement, TEXT )
> RETURNS TEXT AS $$

Given that you're coercing both one input value and the result to text,
I don't understand why you don't just compare the text representations.

I'm also not very clear on what you mean by "comparing column defaults".
A column default is an expression (in the general case anyway), not just
a value of the type.

Maybe if you'd shown us the is() function, as well as a typical usage
of _def_is(), this would be less opaque.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Tom Lane
Jeff Janes  writes:
> But, another perhaps stupid question, why do we care what the value of
> nextOid was at the start of the last successfully completed
> checkpoint?

The intended use of that field is to restore nextOid before replaying
WAL.  So it should correspond to the value at checkpoint start.  I think
what the code is doing now during replay is probably just brain fade :-(

> Wouldn't it make more sense to populate that part of the
> record at the end, not the start, of the checkpoint?

The idea is that we should initialize the OID counter to something
approximately right before we start replay, so that replay of OID-
counter-advance records behaves sensibly.

> But, to round this out, all of this is formally only a hint on where
> to start OIDs so as to avoid performance problems as you busy-loop
> looking for an unused OID.  The real correctness bug is in the
> hint-bit problem you discuss elsewhere that turns collisions into
> corruption, and this bug just makes that other one reachable?

Right, the theory is that being slightly off shouldn't matter.
If that's wrong, we would have race-condition problems with either
intended timing of sampling the OID counter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TABLE lock downgrades have broken pg_upgrade

2016-05-10 Thread Bruce Momjian
On Fri, May  6, 2016 at 03:32:23PM -0400, Tom Lane wrote:
> > I think possibly the easiest fix for this is to have pg_upgrade,
> > instead of RESETting a nonexistent option, RESET something that's
> > still considered to require AccessExclusiveLock.  "user_catalog_table"
> > would work, looks like; though I'd want to annotate its entry in
> > reloptions.c to warn people away from downgrading its lock level.
> 
> I tried fixing it like that.  The alternate RESET target had behaved as
> expected when I'd tested by hand, but in pg_upgrade it still fails,
> only now with
> 
> Creating newly-required TOAST tablesSQL command failed
> ALTER TABLE "public"."i_need_a_toast_table" RESET (user_catalog_table);
> ERROR:  pg_type OID value not set when in binary upgrade mode

I think this means that the ALTER TABLE RESET is adding or potentially
adding a pg_type row, and no one called
binary_upgrade_set_next_pg_type_oid().

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 17:36:06 -0700, Jeff Janes wrote:
> OK, got it.  I don' t know how I missed the bigger picture of that
> function in the first place.
> 
> But, another perhaps stupid question, why do we care what the value of
> nextOid was at the start of the last successfully completed
> checkpoint?  Wouldn't it make more sense to populate that part of the
> record at the end, not the start, of the checkpoint?

I don't think that really makes a large difference. Unless we'd hold
OidGenLock across the XLogInsert() - which we don't want to do for
performance / deadlock reasons - we'd have a similar race.   I think
it's simply important to recognize (and better document!) that the
values in the checkpoint record are from roughly the time of the REDO
pointer and not from the end of the checkpoint.


> >> I don't understand how it could be out-of-date at that point.  But
> >> obviously it is.
> >
> > A checkpoint logically "starts" at 1) in the above abbreviated
> > CreateCheckPoint(), that's where recovery starts when starting up from
> > that checkpoint. But inbetween 1) and 4) all other backends can continue
> > to insert WAL, and it'll be replayed *before* the checkpoint's record
> > itself.  That means that if some backend generates a NEXTOID record
> > between 2) and 4) (with largers checkpoints we're looking at minutes to
> > an hour of time), it's effects will temporarily take effect (as in
> > ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
> > replay will overwrite it unconditionally:
> > void
> > xlog_redo(XLogReaderState *record)
> > {
> > else if (info == XLOG_CHECKPOINT_ONLINE)
> > {
> > ...
> > /* ... but still treat OID counter as exact */
> > LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
> > ShmemVariableCache->nextOid = checkPoint.nextOid;
> > ShmemVariableCache->oidCount = 0;
> > LWLockRelease(OidGenLock);
> >
> > Makes sense?
> 
> So it seems like we should unconditionally **not** do that when
> replaying XLOG_CHECKPOINT_ONLINE, right?

I think we should actually just not do it during replay *at all*. It
seems entirely sufficient to use ->nextOid in StartupXLOG when it has
found the checkpoint to start up from:
/* initialize shared memory variables from the checkpoint record */
ShmemVariableCache->nextXid = checkPoint.nextXid;
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;


FWIW, I've commented out the relevant sections from xlog_redo and since
then I've not been able to reproduce the issue.


> But, to round this out, all of this is formally only a hint on where
> to start OIDs so as to avoid performance problems as you busy-loop
> looking for an unused OID.  The real correctness bug is in the
> hint-bit problem you discuss elsewhere that turns collisions into
> corruption, and this bug just makes that other one reachable?

That's my opinion at least.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Does Type Have = Operator?

2016-05-10 Thread Fabrízio de Royes Mello
Em terça-feira, 10 de maio de 2016, David E. Wheeler 
escreveu:

> Hackers,
>
> pgTAP has a function that compares two values of a given type, which it
> uses for comparing column defaults. It looks like this:
>
> CREATE OR REPLACE FUNCTION _def_is( TEXT, TEXT, anyelement, TEXT )
> RETURNS TEXT AS $$
> DECLARE
> thing text;
> BEGIN
> IF $1 ~ '^[^'']+[(]' THEN
> -- It's a functional default.
> RETURN is( $1, $3, $4 );
> END IF;
>
> EXECUTE 'SELECT is('
>  || COALESCE($1, 'NULL' || '::' || $2) || '::' || $2 || ',
> '
>  || COALESCE(quote_literal($3), 'NULL') || '::' || $2 ||
> ', '
>  || COALESCE(quote_literal($4), 'NULL')
> || ')' INTO thing;
> RETURN thing;
> END;
> $$ LANGUAGE plpgsql;
>
> The is() function does an IS DISTINCT FROM to compare the two values
> passed to it. This has been working pretty well for years, but one place it
> doesn’t work is with JSON values. I get:
>
> LINE 1: SELECT NOT $1 IS DISTINCT FROM $2
>   ^
> HINT:  No operator matches the given name and argument type(s). You
> might need to add explicit type casts.
> QUERY:  SELECT NOT $1 IS DISTINCT FROM $2
>
> This makes sense, of course, and I could fix it by comparing text values
> instead of json values when the values are JSON. But of course the lack of
> a = operator is not limited to JSON. So I’m wondering if there’s an
> interface at the SQL level to tell me whether a type has an = operator?
> That way I could always use text values in those situations.
>
>
Searching for the operator in pg_operator catalog isn't enought?

Regards,



-- 
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


Re: [HACKERS] ALTER TABLE lock downgrades have broken pg_upgrade

2016-05-10 Thread Bruce Momjian
On Tue, May  3, 2016 at 12:07:51PM -0400, Tom Lane wrote:
> I think possibly the easiest fix for this is to have pg_upgrade,
> instead of RESETting a nonexistent option, RESET something that's
> still considered to require AccessExclusiveLock.  "user_catalog_table"
> would work, looks like; though I'd want to annotate its entry in
> reloptions.c to warn people away from downgrading its lock level.
> 
> More generally, though, I wonder how we can have some test coverage
> on such cases going forward.  Is the patch below too ugly to commit
> permanently, and if so, what other idea can you suggest?

FYI, I only test _supported_ version combinations for pg_upgrade, i.e. I
don't test pg_upgrade _from_ unsupported versions, though I can see why
maybe I should.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
On 2016-05-11 03:20:12 +0300, Ants Aasma wrote:
> On Tue, May 10, 2016 at 7:56 PM, Robert Haas  wrote:
> > On Mon, May 9, 2016 at 8:34 PM, David Rowley
> >  wrote:
> > I don't have any at the moment, but I'm not keen on hundreds of new
> > vector functions that can all have bugs or behavior differences versus
> > the unvectorized versions of the same code.  That's a substantial tax
> > on future development.  I think it's important to understand what
> > sorts of queries we are targeting here.  KaiGai's GPU-acceleration
> > stuff does great on queries with complex WHERE clauses, but most
> > people don't care not only because it's out-of-core but because who
> > actually looks for the records where (a + b) % c > (d + e) * f / g?
> > This seems like it has the same issue.  If we can speed up common
> > queries people are actually likely to run, OK, that's interesting.
> 
> I have seen pretty complex expressions in the projection and
> aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
> END) type of expressions. In critical places had to replace them with
> a C coded function that processed a row at a time to avoid the
> executor dispatch overhead.

I've seen that as well, but Was it the actual fmgr indirection causing
the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
On 2016-05-10 12:56:17 -0400, Robert Haas wrote:
> I suspect the number of queries that are being hurt by fmgr overhead
> is really large, and I think it would be nice to attack that problem
> more directly.  It's a bit hard to discuss what's worthwhile in the
> abstract, without performance numbers, but when you vectorize, how
> much is the benefit from using SIMD instructions and how much is the
> benefit from just not going through the fmgr every time?

I think fmgr overhead is an issue, but in most profiles of execution
heavy loads I've seen the bottlenecks are elsewhere. They often seem to
roughly look like
+   15.47%  postgres  postgres   [.] slot_deform_tuple
+   12.99%  postgres  postgres   [.] slot_getattr
+   10.36%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
+9.76%  postgres  postgres   [.] heap_getnext
+6.34%  postgres  postgres   [.] HeapTupleSatisfiesMVCC
+5.09%  postgres  postgres   [.] heapgetpage
+4.59%  postgres  postgres   [.] hash_search_with_hash_value
+4.36%  postgres  postgres   [.] ExecQual
+3.30%  postgres  postgres   [.] ExecStoreTuple
+3.29%  postgres  postgres   [.] ExecScan

or

-   33.67%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
   - ExecMakeFunctionResultNoSets
  + 99.11% ExecEvalOr
  + 0.89% ExecQual
+   14.32%  postgres  postgres   [.] slot_getattr
+5.66%  postgres  postgres   [.] ExecEvalOr
+5.06%  postgres  postgres   [.] check_stack_depth
+5.02%  postgres  postgres   [.] slot_deform_tuple
+4.05%  postgres  postgres   [.] pgstat_end_function_usage
+3.69%  postgres  postgres   [.] heap_getnext
+3.41%  postgres  postgres   [.] ExecEvalScalarVarFast
+3.36%  postgres  postgres   [.] ExecEvalConst


with a healthy dose of _bt_compare, heap_hot_search_buffer in more index
heavy workloads.

(yes, I just pulled these example profiles from somewhere, but I've more
often seen them look like this, than very fmgr heavy).


That seems to suggest that we need to restructure how we get to calling
fmgr functions, before worrying about the actual fmgr call.


Tomas, Mark, IIRC you'd both generated perf profiles for TPC-H (IIRC?)
queries at some point. Any chance the results are online somewhere?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Jeff Janes
On Tue, May 10, 2016 at 4:02 PM, Andres Freund  wrote:
> On 2016-05-10 15:53:38 -0700, Jeff Janes wrote:
>>
>> But isn't CreateCheckPoint called at the end of the checkpoint, not
>> the start of it?
>
> No, CreateCheckPoint() does it all.
>
>
> CreateCheckPoint(int flags)
> {
> ...
> /* 1) determine redo pointer */
> WALInsertLockAcquireExclusive();
> curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
> prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
> WALInsertLockRelease();
> ...
> /* 2) determine oid */
> LWLockAcquire(OidGenLock, LW_SHARED);
> checkPoint.nextOid = ShmemVariableCache->nextOid;
> if (!shutdown)
> checkPoint.nextOid += ShmemVariableCache->oidCount;
> LWLockRelease(OidGenLock);
> ...
> /* 3) actually checkpoint shared_buffers et al. */
> CheckPointGuts(checkPoint.redo, flags);
> ...
> /* 4) log the checkpoint */
> recptr = XLogInsert(RM_XLOG_ID,
> shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
> XLOG_CHECKPOINT_ONLINE);
> ...
> }

OK, got it.  I don' t know how I missed the bigger picture of that
function in the first place.

But, another perhaps stupid question, why do we care what the value of
nextOid was at the start of the last successfully completed
checkpoint?  Wouldn't it make more sense to populate that part of the
record at the end, not the start, of the checkpoint?


>
>
>> I don't understand how it could be out-of-date at that point.  But
>> obviously it is.
>
> A checkpoint logically "starts" at 1) in the above abbreviated
> CreateCheckPoint(), that's where recovery starts when starting up from
> that checkpoint. But inbetween 1) and 4) all other backends can continue
> to insert WAL, and it'll be replayed *before* the checkpoint's record
> itself.  That means that if some backend generates a NEXTOID record
> between 2) and 4) (with largers checkpoints we're looking at minutes to
> an hour of time), it's effects will temporarily take effect (as in
> ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
> replay will overwrite it unconditionally:
> void
> xlog_redo(XLogReaderState *record)
> {
> else if (info == XLOG_CHECKPOINT_ONLINE)
> {
> ...
> /* ... but still treat OID counter as exact */
> LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
> ShmemVariableCache->nextOid = checkPoint.nextOid;
> ShmemVariableCache->oidCount = 0;
> LWLockRelease(OidGenLock);
>
> Makes sense?

So it seems like we should unconditionally **not** do that when
replaying XLOG_CHECKPOINT_ONLINE, right?

If this is the checkpoint record at which we started the system then
we would have done that "replay" into ShmemVariableCache->nextOid
during StartupXLOG(), there is no reason to do it again when redo
passes through that record the 2nd time.

But, to round this out, all of this is formally only a hint on where
to start OIDs so as to avoid performance problems as you busy-loop
looking for an unused OID.  The real correctness bug is in the
hint-bit problem you discuss elsewhere that turns collisions into
corruption, and this bug just makes that other one reachable?

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
On 2016-05-10 12:34:19 +1200, David Rowley wrote:
> a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
> b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
> indicate if the struct contains a single or a multiple tuples.
> Multiple tuples may need to be deformed in a non-lazy fashion in order
> to prevent too many buffers from having to be pinned at once. Tuples
> will be deformed into arrays of each column rather than arrays for
> each tuple (this part is important to support the next sub-project)

FWIW, I don't think that's necessarily required, and it has the
potential to slow down some operations (like target list
processing/projections) considerably. By the time vectored execution for
postgres is ready, gather instructions should be common and fast enough
(IIRC they started to be ok with broadwells, and are better in skylake;
other archs had them for longer).


> c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
> process a batch TupleTableSlot. This will require some tight loop to
> aggregate the entire TupleTableSlot at once before returning.
> d. Add function in execAmi.c which returns true or false depending on
> if the node supports batch TupleTableSlots or not.
> e. At executor startup determine if the entire plan tree supports
> batch TupleTableSlots, if so enable batch scan mode.

It doesn't really need to be the entire tree. Even if you have a subtree
(say a parametrized index nested loop join) which doesn't support batch
mode, you'll likely still see performance benefits by building a batch
one layer above the non-batch-supporting node.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Ants Aasma
On Tue, May 10, 2016 at 7:56 PM, Robert Haas  wrote:
> On Mon, May 9, 2016 at 8:34 PM, David Rowley
>  wrote:
> I don't have any at the moment, but I'm not keen on hundreds of new
> vector functions that can all have bugs or behavior differences versus
> the unvectorized versions of the same code.  That's a substantial tax
> on future development.  I think it's important to understand what
> sorts of queries we are targeting here.  KaiGai's GPU-acceleration
> stuff does great on queries with complex WHERE clauses, but most
> people don't care not only because it's out-of-core but because who
> actually looks for the records where (a + b) % c > (d + e) * f / g?
> This seems like it has the same issue.  If we can speed up common
> queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

> By the way, I think KaiGai's GPU-acceleration stuff points to another
> pitfall here.  There's other stuff somebody might legitimately want to
> do that requires another copy of each function. For example, run-time
> code generation likely needs that (a function to tell the code
> generator what to generate for each of our functions), and
> GPU-acceleration probably does, too.  If fixing a bug in numeric_lt
> requires changing not only the regular version and the vectorized
> version but also the GPU-accelerated version and the codegen version,
> Tom and Dean are going to kill us.  And justifiably so!  Granted,
> nobody is proposing those other features in core right now, but
> they're totally reasonable things to want to do.

My thoughts in this area have been circling around getting LLVM to do
the heavy lifting. LLVM/clang could compile existing C functions to IR
and bundle those with the DB. At query planning time or maybe even
during execution the functions can be inlined into the compiled query
plan, LLVM can then be coaxed to copy propagate, constant fold and
dead code eliminate the bejeezus out of the expression tree. This way
duplication of the specialized code can be kept to a minimum while at
least the common cases can completely avoid the fmgr overhead.

This approach would also mesh together fine with batching. Given
suitably regular data structures and simple functions, LLVM will be
able to vectorize the code. If not it will still end up with a nice
tight loop that is an order of magnitude or two faster than the
current executor.

The first cut could take care of ExecQual, ExecTargetList and friends.
Later improvements could let execution nodes provide basic blocks that
would then be threaded together into the main execution loop. If some
node does not implement the basic block interface a default
implementation is used that calls the current interface. It gets a bit
handwavy at this point, but the main idea would be to enable data
marshaling so that values can be routed directly to the code that
needs them without being written to intermediate storage.

> I suspect the number of queries that are being hurt by fmgr overhead
> is really large, and I think it would be nice to attack that problem
> more directly.  It's a bit hard to discuss what's worthwhile in the
> abstract, without performance numbers, but when you vectorize, how
> much is the benefit from using SIMD instructions and how much is the
> benefit from just not going through the fmgr every time?

My feeling is the same, fmgr overhead and data marshaling, dynamic
dispatch through the executor is the big issue. This is corroborated
by what I have seen found by other VM implementations. Once you get
the data into an uniform format where vectorized execution could be
used, the CPU execution resources are no longer the bottleneck. Memory
bandwidth gets in the way, unless each input value is used in multiple
calculations. And even then, we are looking at a 4x speedup at best.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Does Type Have = Operator?

2016-05-10 Thread David E. Wheeler
Hackers,

pgTAP has a function that compares two values of a given type, which it uses 
for comparing column defaults. It looks like this:

CREATE OR REPLACE FUNCTION _def_is( TEXT, TEXT, anyelement, TEXT )
RETURNS TEXT AS $$
DECLARE
thing text;
BEGIN
IF $1 ~ '^[^'']+[(]' THEN
-- It's a functional default.
RETURN is( $1, $3, $4 );
END IF;

EXECUTE 'SELECT is('
 || COALESCE($1, 'NULL' || '::' || $2) || '::' || $2 || ', '
 || COALESCE(quote_literal($3), 'NULL') || '::' || $2 || ', '
 || COALESCE(quote_literal($4), 'NULL')
|| ')' INTO thing;
RETURN thing;
END;
$$ LANGUAGE plpgsql;

The is() function does an IS DISTINCT FROM to compare the two values passed to 
it. This has been working pretty well for years, but one place it doesn’t work 
is with JSON values. I get:

LINE 1: SELECT NOT $1 IS DISTINCT FROM $2
  ^
HINT:  No operator matches the given name and argument type(s). You might 
need to add explicit type casts.
QUERY:  SELECT NOT $1 IS DISTINCT FROM $2

This makes sense, of course, and I could fix it by comparing text values 
instead of json values when the values are JSON. But of course the lack of a = 
operator is not limited to JSON. So I’m wondering if there’s an interface at 
the SQL level to tell me whether a type has an = operator? That way I could 
always use text values in those situations.

Thanks,

David




smime.p7s
Description: S/MIME cryptographic signature


[HACKERS] alter table alter column ... (larger type) ... when there are dependent views

2016-05-10 Thread Rob Bygrave
I have read the prior discussions linked from
https://wiki.postgresql.org/wiki/Todo#Views_and_Rules

What I would like to do is put the specific case for handling 3 common
'alter column' changes when that column is referenced in a view.


Take the case of:
create table base_table ( id bigserial, acol varchar(10), bcol int, ccol
varchar(1000));

create or replace view base_view as select id, acol, bcol, ccol from
base_table;
create or replace view dep_view as select id, length(acol), bcol, ccol from
base_view;


I would like to review the 3 specific cases:

alter table base_table alter column acol type varchar(20);-- was
varchar(10)
alter table base_table alter column bcol type bigint;  -- was
int
alter table base_table alter column ccol type text; -- was
varchar(1000)


At the moment these 3 statements all require that the views be dropped
first and then re-created last. The first case looks like:

-- for the varchar(10) to varchar(20) change
drop view dep_view;
drop view base_view;
alter table base_table ALTER COLUMN bcol type varchar(20);
create or replace view base_view as select id, acol, bcol from base_table;
create or replace view dep_view as select id, length(acol) from base_view;

In practical terms this need to drop and re-create the views gets
harder/bigger with more view dependencies.  With large complex schema's
this becomes a rather large pain.

Having read all the previous discussions on the general topic of altering
tables with dependent views I realise this is a complex and difficult issue
in general but I'd like to see if there was some support for looking at
these 3 more specific changes.
1. making a varchar column larger  e.g. varchar(10) -> varchar(20)
2. changing int to bigint
3. changing varchar to text

I have seen that there are people motivated enough to update pg_attribute
directly (update pg_attribute a set a.atttypmod = 20 + 4 ...).

What are the thoughts on support these 3 specific cases?


Thanks, Rob.


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
Hi,

On 2016-05-09 13:33:55 -0400, Robert Haas wrote:
> I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:


3) We use a lot of very cache-inefficient datastructures.

Especially the pervasive use of linked lists in the executor is pretty
bad for performance. Every element is likely to incur cache misses,
every list element pretty much has it's own cacheline (thereby reducing
overall cache hit ratio), they have a horrible allocation overhead (both
space and palloc runtime).


> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  [...].  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.


> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant.  Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.


I'd personally say that the CPU pipeline defeating aspect is worse than
the effect of the cache/branch misses themselves. Today's CPUs are
heavily superscalar, and our instruction-per-cycle throughput shows
pretty clearly that we're not good at employing (micro-)instruction
paralellism. We're quite frequently at well below one instruction/cycle.





> My proposal for how to do this is to make ExecProcNode function as a
> backward-compatibility wrapper.  For asynchronous execution, a node
> might return a not-ready-yet indication, but if that node is called
> via ExecProcNode, it means the caller isn't prepared to receive such
> an indication, so ExecProcNode will just wait for the node to become
> ready and then return the tuple.  Similarly, for vectorized execution,
> a node might return a bunch of tuples all at once.  ExecProcNode will
> extract the first one and return it to the caller, and subsequent
> calls to ExecProcNode will iterate through the rest of the batch, only
> calling the underlying node-specific function when the batch is
> exhausted.  In this way, code that doesn't know about the new stuff
> can continue to work pretty much as it does today.

I agree that that generally is a reasonable way forward.


> Also, and I think
> this is important, nodes don't need the permission of their parent
> node to use these new capabilities.  They can use them whenever they
> wish, without worrying about whether the upper node is prepared to
> deal with it.  If not, ExecProcNode will paper over the problem.  This
> seems to me to be a good way to keep the code simple.

Maybe not permission, but for some cases it seems to be important to
hint to *not* prefetch a lot of rows. E.g. anti joins come to mind. Just
using batching with force seems likely to regress some queries quite
badly (e.g an expensive join inside an EXISTS() which returns many
tuples).


> For asynchronous execution, I have gone so far as to mock up a bit of
> what this might look like.  This shouldn't be taken very seriously at
> this point, but I'm attaching a few very-much-WIP patches to show the
> direction of my line of 

Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 15:53:38 -0700, Jeff Janes wrote:
> On Tue, May 10, 2016 at 2:00 PM, Andres Freund  wrote:
> > I think that's to blame here.  Looking at the relevant WAL record shows:
> >
> > pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 
> > 'CHECKPOINT|NEXTOID'
> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> > rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn: 
> > 2/12023C38, prev 2/12023C00, desc: CHECKPOINT_ONLINE redo 2/12000120; /* 
> > ... */ oid 4294501; /* ... */ online
>
> By my understanding, this is the point at which the crash occurred.

Right.

> > rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn: 
> > 2/1327A798, prev 2/1327A768, desc: CHECKPOINT_SHUTDOWN redo 2/1327A798; /* 
> > ... */ oid 4294501; /* ... */ shutdown
> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
> >
> > (note that end-of-recovery checkpoints are logged as shutdown
> > checkpoints, pretty annoying imo)
> >
> > So I think the issue is that the 2/12023C38 checkpoint *starts* before
> > the first NEXTOID, and thus gets an earlier nextoid.
>
>
> But isn't CreateCheckPoint called at the end of the checkpoint, not
> the start of it?

No, CreateCheckPoint() does it all.


CreateCheckPoint(int flags)
{
...
/* 1) determine redo pointer */
WALInsertLockAcquireExclusive();
curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
prevPtr = XLogBytePosToRecPtr(Insert->PrevBytePos);
WALInsertLockRelease();
...
/* 2) determine oid */
LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
...
/* 3) actually checkpoint shared_buffers et al. */
CheckPointGuts(checkPoint.redo, flags);
...
/* 4) log the checkpoint */
recptr = XLogInsert(RM_XLOG_ID,
shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE);
...
}


> I don't understand how it could be out-of-date at that point.  But
> obviously it is.

A checkpoint logically "starts" at 1) in the above abbreviated
CreateCheckPoint(), that's where recovery starts when starting up from
that checkpoint. But inbetween 1) and 4) all other backends can continue
to insert WAL, and it'll be replayed *before* the checkpoint's record
itself.  That means that if some backend generates a NEXTOID record
between 2) and 4) (with largers checkpoints we're looking at minutes to
an hour of time), it's effects will temporarily take effect (as in
ShmemVariableCache->nextOid is updated), but XLOG_CHECKPOINT_ONLINE's
replay will overwrite it unconditionally:
void
xlog_redo(XLogReaderState *record)
{
else if (info == XLOG_CHECKPOINT_ONLINE)
{
...
/* ... but still treat OID counter as exact */
LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
ShmemVariableCache->nextOid = checkPoint.nextOid;
ShmemVariableCache->oidCount = 0;
LWLockRelease(OidGenLock);

Makes sense?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Tom Lane
Andres Freund  writes:
> On 2016-05-10 18:29:59 -0400, Tom Lane wrote:
>> Having said that, I still say that changing HeapTupleSatisfiesToast
>> is the wrong thing.  It can't go deciding not to return toast values
>> because they're committed dead --- the parent tuple could easily be
>> committed dead as well, and yet still be visible to our query's
>> snapshot.

> Hm. Shouldn't a mvcc snapshot be able to differentiate between those
> cases?

HeapTupleSatisfiesToast doesn't have one.  And changing things so that
toast tuples are checked using MVCC rules is the wrong thing anyway,
because it would require adding hint-bit update traffic for toast
tables.

> When are we looking up toast tuple that's *not* visible to the
> current snapshot?

Once again, it's the parent tuple where we should be doing the
visibility check; noplace else.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Jeff Janes
On Tue, May 10, 2016 at 2:00 PM, Andres Freund  wrote:
> On 2016-05-10 09:19:16 -0700, Andres Freund wrote:
>> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
>> > On Tue, May 10, 2016 at 3:05 AM, Andres Freund  wrote:
>> > > The easy way to trigger this problem would be to have an oid wraparound
>> > > - but the WAL shows that that's not the case here.  I've not figured
>> > > that one out entirely (and won't tonight). But I do see WAL records
>> > > like:
>> > > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
>> > > 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
>> > > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
>> > > 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
>> > > i.e. two NEXTOID records allocating the same range, which obviously
>> > > doesn't seem right.  There's also every now and then close by ranges:
>> > > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
>> > > 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455
>> > > rmgr: XLOGlen (rec/tot):  4/30, tx:7814505, lsn: 
>> > > 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461
>
>> > It seems to me that the real question
>> > here is how you're getting two calls to XLogPutNextOid() with the same
>> > value of ShmemVariableCache->nextOid, and the answer, as it seems to
>> > me, must be that LWLocks are broken.
>>
>> There likely were a bunch of crashes in between, Jeff's test suite
>> triggers them at a high rate. It seems a lot more likely than that an
>> lwlock bug only materializes in the oid counter.  Investigating.
>
> void
> CreateCheckPoint(int flags)
> {
> ...
> /*
>  * An end-of-recovery checkpoint is really a shutdown checkpoint, just
>  * issued at a different time.
>  */
> if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
> shutdown = true;
> else
> shutdown = false;
> ...
>
> LWLockAcquire(OidGenLock, LW_SHARED);
> checkPoint.nextOid = ShmemVariableCache->nextOid;
> if (!shutdown)
> checkPoint.nextOid += ShmemVariableCache->oidCount;
> LWLockRelease(OidGenLock);
> ...
> recptr = XLogInsert(RM_XLOG_ID,
> shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
> XLOG_CHECKPOINT_ONLINE);
> ...
> }
>
> I think that's to blame here.  Looking at the relevant WAL record shows:
>
> pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 
> 'CHECKPOINT|NEXTOID'
> rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn: 
> 2/12023C38, prev 2/12023C00, desc: CHECKPOINT_ONLINE redo 2/12000120; /* ... 
> */ oid 4294501; /* ... */ online

By my understanding, this is the point at which the crash occurred.

> rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn: 
> 2/1327A798, prev 2/1327A768, desc: CHECKPOINT_SHUTDOWN redo 2/1327A798; /* 
> ... */ oid 4294501; /* ... */ shutdown
> rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
>
> (note that end-of-recovery checkpoints are logged as shutdown
> checkpoints, pretty annoying imo)
>
> So I think the issue is that the 2/12023C38 checkpoint *starts* before
> the first NEXTOID, and thus gets an earlier nextoid.


But isn't CreateCheckPoint called at the end of the checkpoint, not
the start of it?

I don't understand how it could be out-of-date at that point.  But
obviously it is.


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-05-10 Thread Vladimir Gordiychuk
>
> Same thread, I just think these are two somewhat separate changes. One is
> just in the walsender and allows return to command mode during waiting for
> WAL. The other is more intrusive into the reorder buffer etc and allows
> aborting decoding during commit processing. So two separate patches make
> sense here IMO, one on top of the other.


About the second part of the patch. What the reason decode and send whole
transaction? Why we can't process logical decoding via WalSndLoop LSN by
LSN as it work in physycal replication? For example if transaction contains
in more them one LSN, first we decode and send "begin", "part data from
current LSN" and then returns to WalSndLoop on the next iteration we send
"another part data", "commit". I don't research in this way, because I
think it will be big changes in comparison callback that stop sending.

2016-05-10 20:22 GMT+03:00 Craig Ringer :

> On 11 May 2016 at 01:15, Vladimir Gordiychuk  wrote:
>
>> in which release can be include first part?
>>
>
> Since it's not a bug fix, I don't think it can go in before 9.7.
>
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 18:29:59 -0400, Tom Lane wrote:
> Now having said that, you don't actually need *rapid* advancement
> of the OID counter to have a potential problem.  Imagine that some
> transaction inserts a TOAST value and later fails, so that you have
> a dead-but-unhinted toast tuple sitting there.  If no VACUUM comes
> along to clean it up (quite possible if there's low activity on
> the parent table)

Or if there's primarily inserts...


> Having said that, I still say that changing HeapTupleSatisfiesToast
> is the wrong thing.  It can't go deciding not to return toast values
> because they're committed dead --- the parent tuple could easily be
> committed dead as well, and yet still be visible to our query's
> snapshot.

Hm. Shouldn't a mvcc snapshot be able to differentiate between those
cases? When are we looking up toast tuple that's *not* visible to the
current snapshot?   I guess there could be some uglyness around
RETURNING for DELETE.   I guess we'd have to use logic akin to
SatisfiesVacuum, treating HEAPTUPLE_*_IN_PROGRESS/RECENTLY_DEAD as
alive.


> Nor is it attractive to introduce more hint-bit-setting
> requirements for toast tuples; that just increases the amount of
> hint-bit-setting disk traffic that must occur subsequently to
> any update involving toast values.

Yea, that's the big argument against going this way.


> Maybe the right fix here is for toast OID assignment to use
> GetNewOidWithIndex with SnapshotToast, or some special-purpose
> snapshot that does the right thing, so that it would see such a
> dead toast row as being a conflict.  (If you got enough such
> dead rows, this might make it hard to find a free OID; but hopefully
> vacuum would come along and fix things before it got to that.)

Seems like we'd have to either emit cleanup records, or treat even
tuples that are hinted as dead as a conflict (lest they get lost in a
crash or failover).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Tom Lane
Andres Freund  writes:
> On 2016-05-10 16:14:50 -0400, Tom Lane wrote:
>> I think the problem is pretty hypothetical until you get to consuming a
>> substantial part of the OID space within any one toast table, at which
>> point you're going to need 8-byte toast OIDs.  Improving that situation
>> seems like something we can define as a future feature.

> Doesn't really have to be within a single toast table. As the oid
> counter is global, it just needs to advance quickly enough globally.

I don't think that alters the conclusion.  To get rapid advancement
through the whole 4G OID space, you'd need densely packed existing
OIDs in all of the toast tables being inserted into.  And the more
toast tables in play, the less likely any one of them is densely
packed, because it's likely some OIDs were assigned into other
toast tables.

Now having said that, you don't actually need *rapid* advancement
of the OID counter to have a potential problem.  Imagine that some
transaction inserts a TOAST value and later fails, so that you have
a dead-but-unhinted toast tuple sitting there.  If no VACUUM comes
along to clean it up (quite possible if there's low activity on
the parent table), then after 4G uses of OIDs, you could be unlucky
enough that the next attempted insertion into that TOAST table chooses
the same OID.  Now, that choosing would be done by GetNewOidWithIndex
which would have to have seen the dead toast tuple; but if its hinting of
the tuple got lost somehow then you could have two tuples with the same
OID and neither hinted as dead.  Then subsequent TOAST fetches have
a problem that they might take the wrong one.

Having said that, I still say that changing HeapTupleSatisfiesToast
is the wrong thing.  It can't go deciding not to return toast values
because they're committed dead --- the parent tuple could easily be
committed dead as well, and yet still be visible to our query's
snapshot.  Nor is it attractive to introduce more hint-bit-setting
requirements for toast tuples; that just increases the amount of
hint-bit-setting disk traffic that must occur subsequently to
any update involving toast values.

Maybe the right fix here is for toast OID assignment to use
GetNewOidWithIndex with SnapshotToast, or some special-purpose
snapshot that does the right thing, so that it would see such a
dead toast row as being a conflict.  (If you got enough such
dead rows, this might make it hard to find a free OID; but hopefully
vacuum would come along and fix things before it got to that.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perf Benchmarking and regression.

2016-05-10 Thread Andres Freund
Hi,

On 2016-05-06 21:21:11 +0530, Mithun Cy wrote:
> I will try to run the tests as you have suggested and will report the same.

Any news on that front?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 13:17:52 -0700, Jeff Janes wrote:
> On Tue, May 10, 2016 at 9:19 AM, Andres Freund  wrote:
> > On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> >> On Tue, May 10, 2016 at 3:05 AM, Andres Freund  wrote:
> >> > The easy way to trigger this problem would be to have an oid wraparound
> >> > - but the WAL shows that that's not the case here.  I've not figured
> >> > that one out entirely (and won't tonight). But I do see WAL records
> >> > like:
> >> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> >> > 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> >> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> >> > 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
> 
> Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID
> records, between those two records you show?

Yes, check 
http://www.postgresql.org/message-id/20160510210013.2akn4iee7gl4y...@alap3.anarazel.de

I think the explanation about how the bug is occuring there makes sense.


> My current test harness updates the scalar count field on every
> iteration, but changes the (probably toasted) text_array field with a
> probability of only 1% each time.  Perhaps making that more likely (by
> changing line 186 of count.pl) would make it easier to trigger the
> bug.  I'll try that in my next iteration of tests.

So my current theory about why the whole thing is kinda hard to
reproduce is that "luck" determines how aggressively the toast table is
vacuumed, and how often it actually succeeds in being vacuumed. You also
need a good bit of bad luck for the hint bits by GetNewOidWithIndex() to
not survive, given that shared_buffers is pretty small *and* checksums
are enabled.

I guess testing with a bigger shared memory and without checksums will
make it easier to hit the bug.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 16:14:50 -0400, Tom Lane wrote:
> I think the problem is pretty hypothetical until you get to consuming a
> substantial part of the OID space within any one toast table, at which
> point you're going to need 8-byte toast OIDs.  Improving that situation
> seems like something we can define as a future feature.

Doesn't really have to be within a single toast table. As the oid
counter is global, it just needs to advance quickly enough globally.


> (Note that I do not believe for a moment that Andres has actually
> exhibited such a problem in his test case.  He'd have needed several
> TB worth of TOAST space to get to that point.)

That's good, because I explicitly said that there was no wraparound
involved in the test case.


> >> In principle, you could support existing TOAST tables and pointers
> >> containing 4-byte IDs in parallel with the new ones.  Not sure how
> >> pg_upgrade would handle it exactly though.
> 
> > I suppose the real problem is that there's no way to have a mix of 4-
> > and 8-byte identifiers in the same toast table.  I suppose we could have
> > two toast tables for each regular table, but that sounds pretty painful.
> 
> Hmm.  It's not impossible by any means, since you could tell by looking
> at a toast OID which table it was in.  And that way might provide a less
> painful upgrade process for an existing table that grew to the point of
> needing bigger OIDs.

The catalog representation (as in pg_class.reltoastrelid) isn't entirely
clear to me.  One way would be to invert pg_class.reltoastrelid's
meaning and have the toast table point to the table it stores values
for. That'd also open the potential of having one toast table per column
and such.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 15:20:39 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Robert Haas  writes:
> > > On Tue, May 10, 2016 at 12:19 PM, Andres Freund  
> > > wrote:
> > >> It's not super likely, yea. But you don't really need to "use" 4 billion
> > >> oids to get a wraparound. Once you have a significant number of values
> > >> in various toast tables, the oid counter progresses really rather fast,
> > >> without many writes. That's because the oid counter is global, but each
> > >> individual toast write (and other things), perform checks via
> > >> GetNewOidWithIndex().
> > 
> > > Understood.
> > 
> > Sooner or later we are going to need to go to 8-byte TOAST object
> > identifiers.  Maybe we should think about doing that sooner not later
> > rather than trying to invent some anti-wraparound solution here.
> 
> Umm, it seems to me like we need this fixed in supported branches, not
> just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at
> this point.

Agreed. To me it seems we need to make HeapTupleSatisfiesToast() safe,
and other improvements are tangential.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Andres Freund
On 2016-05-10 13:36:32 -0400, Robert Haas wrote:
> On Tue, May 10, 2016 at 12:31 PM, Tomas Vondra
>  wrote:
> > The following table shows the differences between the disabled and reverted
> > cases like this:
> >
> > sum('reverted' results with N clients)
> > - 1.0
> > sum('disabled' results with N clients)
> >
> > for each scale/client count combination. So for example 4.83% means with a
> > single client on the smallest data set, the sum of the 5 runs for reverted
> > was about 1.0483x than for disabled.
> >
> > scale1   16   32  64  128
> > 100  4.83%2.84%1.21%   1.16%3.85%
> > 3000 1.97%0.83%1.78%   0.09%7.70%
> > 1   -6.94%   -5.24%  -12.98%  -3.02%   -8.78%
> 
> /me scratches head.
> 
> That doesn't seem like noise, but I don't understand the
> scale-factor-1 results either.

Hm. Could you change max_connections by 1 and 2 and run the 10k tests
again for each value? I wonder whether we're seing the affect of changed
shared memory alignment.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Kevin Grittner
On Tue, May 10, 2016 at 2:41 PM, Kevin Grittner  wrote:

> http://www.postgresql.org/message-id/flat/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com

Re-reading that thread I was reminded that I had more NUMA problems
when data all landed in one memory node, as can happen with pgbench
-i.  Note that at scale 100 and 3000 all data would fit in one NUMA
node, and very likely was all going through one CPU socket.  The 2
hour warm-up might have rebalanced that to some degree or other,
but as an alternative to the cpuset and NUMA patch, you could stop
PostgreSQL after the initialize, discard OS cache, and start fresh
before your warm-up.  That should accomplish about the same thing
-- to better balance the use of memory across the memory nodes and
CPU sockets.

On most NUMA systems you can use this command to see how much
memory is in use on which nodes:

numactl --hardware

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Bert
hmm, the morsels paper looks really interesting at first sight.
Let's see if we can get a poc working in PostgreSQL? :-)

On Tue, May 10, 2016 at 9:42 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> On 05/10/2016 08:26 PM, Robert Haas wrote:
>
>> On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
>>  wrote:
>>
>>> What's wrong with it that worker is blocked? You can just have more
>>> workers
>>> (more than CPU cores) to let other of them continue to do useful work.
>>>
>> Not really.  The workers are all running the same plan, so they'll all
>> make the same decision about which node needs to be executed next.  If
>> that node can't accommodate multiple processes trying to execute it at
>> the same time, it will have to block all of them but the first one.
>> Adding more processes just increases the number of processes sitting
>> around doing nothing.
>>
>
> Doesn't this actually mean that we need to have normal job scheduler which
> is given queue of jobs and having some pool of threads will be able to
> orginize efficient execution of queries? Optimizer can build pipeline
> (graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
> Sort, ... Each task is splitted into several jobs which can be concurretly
> scheduled by task dispatcher.  So you will not have blocked worker waiting
> for something and all system resources will be utilized. Such approach with
> dispatcher allows to implement quotas, priorities,... Also dispatches can
> care about NUMA and cache optimizations which is especially critical on
> modern architectures. One more reference:
> http://db.in.tum.de/~leis/papers/morsels.pdf
>
> Sorry, may be I wrong, but I still think that async.ops is "multitasking
> for poor":)
> Yes, maintaining threads and especially separate processes adds
> significant overhead. It leads to extra resources consumption and context
> switches are quite expensive. And I know from my own experience that
> replacing several concurrent processes performing some IO (for example with
> sockets) with just one process using multiplexing allows to increase
> performance. But still async. ops. is just a way to make programmer
> responsible for managing state machine instead of relying on OS tomake
> context switches. Manual transmission is still more efficient than
> automatic transmission. But still most drives prefer last one;)
>
> Seriously, I carefully read your response to Kochei, but still not
> convinced that async. ops. is what we need.  Or may be we just understand
> different thing by this notion.
>
>
>
>
>> But there are some researches, for example:
>>>
>>> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
>>>
>>> showing that the same or even better effect can be achieved by generation
>>> native code for query execution plan (which is not so difficult now,
>>> thanks
>>> to LLVM).
>>> It eliminates interpretation overhead and increase cache locality.
>>> I get similar results with my own experiments of accelerating SparkSQL.
>>> Instead of native code generation I used conversion of query plans to C
>>> code
>>> and experiment with different data representation. "Horisontal model"
>>> with
>>> loading columns on demands shows better performance than columnar store.
>>>
>> Yes, I think this approach should also be considered.
>>
>> At this moment (February) them have implemented translation of only few
>>> PostgreSQL operators used by ExecQuals  and do not support aggregates.
>>> Them get about 2 times increase of speed at synthetic queries and 25%
>>> increase at TPC-H Q1 (for Q1 most critical is generation of native code
>>> for
>>> aggregates, because ExecQual itself takes only 6% of time for this
>>> query).
>>> Actually these 25% for Q1 were achieved not by using dynamic code
>>> generation, but switching from PULL to PUSH model in executor.
>>> It seems to be yet another interesting PostgreSQL executor
>>> transformation.
>>> As far as I know, them are going to publish result of their work to open
>>> source...
>>>
>> Interesting.  You may notice that in "asynchronous mode" my prototype
>> works using a push model of sorts.  Maybe that should be taken
>> further.
>>
>>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Bert Desmet
0477/305361


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 09:19:16 -0700, Andres Freund wrote:
> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> > On Tue, May 10, 2016 at 3:05 AM, Andres Freund  wrote:
> > > The easy way to trigger this problem would be to have an oid wraparound
> > > - but the WAL shows that that's not the case here.  I've not figured
> > > that one out entirely (and won't tonight). But I do see WAL records
> > > like:
> > > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > > 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> > > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > > 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
> > > i.e. two NEXTOID records allocating the same range, which obviously
> > > doesn't seem right.  There's also every now and then close by ranges:
> > > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > > 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455
> > > rmgr: XLOGlen (rec/tot):  4/30, tx:7814505, lsn: 
> > > 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461

> > It seems to me that the real question
> > here is how you're getting two calls to XLogPutNextOid() with the same
> > value of ShmemVariableCache->nextOid, and the answer, as it seems to
> > me, must be that LWLocks are broken.
> 
> There likely were a bunch of crashes in between, Jeff's test suite
> triggers them at a high rate. It seems a lot more likely than that an
> lwlock bug only materializes in the oid counter.  Investigating.

void
CreateCheckPoint(int flags)
{
...
/*
 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 * issued at a different time.
 */
if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
shutdown = true;
else
shutdown = false;
...

LWLockAcquire(OidGenLock, LW_SHARED);
checkPoint.nextOid = ShmemVariableCache->nextOid;
if (!shutdown)
checkPoint.nextOid += ShmemVariableCache->oidCount;
LWLockRelease(OidGenLock);
...
recptr = XLogInsert(RM_XLOG_ID,
shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
XLOG_CHECKPOINT_ONLINE);
...
}

I think that's to blame here.  Looking at the relevant WAL record shows:

pg_xlogdump -p /data/freund/jj -s 2/12004018 -e 2/1327EA28|grep -E 
'CHECKPOINT|NEXTOID'
rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
2/12004018, prev 2/12003288, desc: NEXTOID 4302693
rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn: 
2/12023C38, prev 2/12023C00, desc: CHECKPOINT_ONLINE redo 2/12000120; /* ... */ 
oid 4294501; /* ... */ online
rmgr: XLOGlen (rec/tot): 80/   106, tx:  0, lsn: 
2/1327A798, prev 2/1327A768, desc: CHECKPOINT_SHUTDOWN redo 2/1327A798; /* ... 
*/ oid 4294501; /* ... */ shutdown
rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693

(note that end-of-recovery checkpoints are logged as shutdown
checkpoints, pretty annoying imo)

So I think the issue is that the 2/12023C38 checkpoint *starts* before
the first NEXTOID, and thus gets an earlier nextoid.  The second -
shutdown/end-of-recovery - checkpoint then hits the above !shutdown and
doesn't add oidCount.  Thus after the crash we continue with a repeated
NEXOID.

There's this remark in xlog_redo():
/*
 * We used to try to take the maximum of 
ShmemVariableCache->nextOid
 * and the recorded nextOid, but that fails if the OID counter 
wraps
 * around.  Since no OID allocation should be happening during 
replay
 * anyway, better to just believe the record exactly.  We still 
take
 * OidGenLock while setting the variable, just in case.
 */

I think that was perhaps not the best fix :(


I guess what we should do is to only use checkPoint.nextOid when
starting up from a checkpoint, and entirely rely on NEXTOID otherwise?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Jim Nasby

On 5/10/16 12:47 AM, Kouhei Kaigai wrote:

> On 10 May 2016 at 13:38, Kouhei Kaigai  wrote:

> > My concern about ExecProcNode is, it is constructed with a large switch
> > ... case statement. It involves tons of comparison operation at run-time.
> > If we replace this switch ... case by function pointer, probably, it make
> > performance improvement. Especially, OLAP workloads that process large
> > amount of rows.

>
> I imagined that any decent compiler would have built the code to use
> jump tables for this. I have to say that I've never checked to make
> sure though.
>

Ah, indeed, you are right. Please forget above part.


Even so, I would think that the simplification in the executor would be 
worth it. If you need to add a new node there's dozens of places where 
you might have to mess with these giant case statements.


In python (for example), types (equivalent to nodes in this case) have 
data structures that define function pointers for a core set of 
operations (such as doing garbage collection, or generating a string 
representation). That means that to add a new type at the C level you 
only need to define a C structure that has the expected members, and an 
initializer function that will properly set everything up when you 
create a new instance. There's no messing around with the rest of the 
guts of python.


*Even more important, everything you need to know about the type is 
contained in one place, not spread throughout the code.*

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump dump catalog ACLs

2016-05-10 Thread Stephen Frost
Alvaro,

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> The subquery comparing the OID of pg_class using only a condition on
> relname seems wrong; wouldn't it fail or produce wrong results if
> somebody creates a table named pg_class in another schema?  I think you
> should write the comparison like this instead:
>   classoid = 'pg_catalog.pg_class'::regclass

Thanks, patch attached which does that (qualifying to the same level as
the surrounding query for each).

I've run it through my tests and will plan to push it tomorrow.

Thanks!

Stephen
From e294008daa8e909059c94441643157fddf9af9b6 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Tue, 10 May 2016 12:55:11 -0400
Subject: [PATCH] Qualify table usage in dumpTable() and use regclass

All of the other tables used in the query in dumpTable(), which is
collecting column-level ACLs, are qualified, so we should be qualifying
the pg_init_privs and related sub-select against pg_class too.

Also, use ::regclass (or ::pg_catalog.regclass, where appropriate)
instead of using a poorly constructed query to get the OID for various
catalog tables.

Issues identified by Noah and Alvaro, patch by me.
---
 src/bin/pg_dump/pg_dump.c | 59 ---
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 1267afb..1d985e4 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -2825,8 +2825,8 @@ getBlobs(Archive *fout)
 		  "%s AS initrlomacl "
 		  "FROM pg_largeobject_metadata l "
 		  "LEFT JOIN pg_init_privs pip ON "
-		  "(l.oid = pip.objoid AND pip.classoid = "
-"(SELECT oid FROM pg_class WHERE relname = 'pg_largeobject')"
+		  "(l.oid = pip.objoid "
+		  "AND pip.classoid = 'pg_largeobject'::regclass "
 		  "AND pip.objsubid = 0) ",
 		  username_subquery,
 		  acl_subquery->data,
@@ -3569,8 +3569,8 @@ getNamespaces(Archive *fout, int *numNamespaces)
 		  "%s as initrnspacl "
 		  "FROM pg_namespace n "
 		  "LEFT JOIN pg_init_privs pip "
-		  "ON (n.oid = pip.objoid AND pip.classoid = "
- "(SELECT oid FROM pg_class WHERE relname = 'pg_namespace') "
+		  "ON (n.oid = pip.objoid "
+		  "AND pip.classoid = 'pg_namespace'::regclass "
 		  "AND pip.objsubid = 0) ",
 		  username_subquery,
 		  acl_subquery->data,
@@ -3851,9 +3851,9 @@ getTypes(Archive *fout, int *numTypes)
 		  "t.typname[0] = '_' AND t.typelem != 0 AND "
 		  "(SELECT typarray FROM pg_type te WHERE oid = t.typelem) = t.oid AS isarray "
 		  "FROM pg_type t "
-		  "LEFT JOIN pg_init_privs pip "
-		  "ON (t.oid = pip.objoid AND pip.classoid = "
-	  "(SELECT oid FROM pg_class WHERE relname = 'pg_type') "
+		  "LEFT JOIN pg_init_privs pip ON "
+		  "(t.oid = pip.objoid "
+		  "AND pip.classoid = 'pg_type'::regclass "
 		  "AND pip.objsubid = 0) ",
 		  acl_subquery->data,
 		  racl_subquery->data,
@@ -4713,8 +4713,8 @@ getAggregates(Archive *fout, int *numAggs)
 		  "%s AS initraggacl "
 		  "FROM pg_proc p "
 		  "LEFT JOIN pg_init_privs pip ON "
-		  "(p.oid = pip.objoid AND pip.classoid = "
-	  "(SELECT oid FROM pg_class WHERE relname = 'pg_proc') "
+		  "(p.oid = pip.objoid "
+		  "AND pip.classoid = 'pg_proc'::regclass "
 		  "AND pip.objsubid = 0) "
 		  "WHERE p.proisagg AND ("
 		  "p.pronamespace != "
@@ -4956,8 +4956,8 @@ getFuncs(Archive *fout, int *numFuncs)
 		  "(%s p.proowner) AS rolname "
 		  "FROM pg_proc p "
 		  "LEFT JOIN pg_init_privs pip ON "
-		  "(p.oid = pip.objoid AND pip.classoid = "
-	  "(SELECT oid FROM pg_class WHERE relname = 'pg_proc') "
+		  "(p.oid = pip.objoid "
+		  "AND pip.classoid = 'pg_proc'::regclass "
 		  "AND pip.objsubid = 0) "
 		  "WHERE NOT proisagg "
 		  "AND NOT EXISTS (SELECT 1 FROM pg_depend "
@@ -5246,9 +5246,10 @@ getTables(Archive *fout, int *numTables)
 		  "CASE WHEN 'check_option=local' = ANY (c.reloptions) THEN 'LOCAL'::text "
 		  "WHEN 'check_option=cascaded' = ANY (c.reloptions) THEN 'CASCADED'::text ELSE NULL END AS checkoption, "
 		  "tc.reloptions AS toast_reloptions, "
-		  "EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON"
-		  "(c.oid = pip.objoid AND pip.classoid = "
-		  "(SELECT oid FROM pg_class WHERE relname = 'pg_class') AND pip.objsubid = at.attnum)"
+		  "EXISTS (SELECT 1 FROM pg_attribute at LEFT JOIN pg_init_privs pip ON "
+		  "(c.oid = pip.objoid "
+		  "AND pip.classoid = 'pg_class'::regclass "
+		  "AND pip.objsubid = at.attnum)"
 		  "WHERE at.attrelid = c.oid AND ("
 		  "%s IS NOT NULL "
 		  "OR %s IS NOT NULL "
@@ -5264,9 +5265,9 @@ getTables(Archive *fout, int *numTables)
 		  "d.refclassid = c.tableoid AND d.deptype = 'a') "
 	   "LEFT JOIN pg_class tc ON (c.reltoastrelid = 

Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Jeff Janes
On Tue, May 10, 2016 at 9:19 AM, Andres Freund  wrote:
> On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
>> On Tue, May 10, 2016 at 3:05 AM, Andres Freund  wrote:
>> > The easy way to trigger this problem would be to have an oid wraparound
>> > - but the WAL shows that that's not the case here.  I've not figured
>> > that one out entirely (and won't tonight). But I do see WAL records
>> > like:
>> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
>> > 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
>> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
>> > 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693

Were there any CHECKPOINT_SHUTDOWN records, or any other NEXTOID
records, between those two records you show?


My current test harness updates the scalar count field on every
iteration, but changes the (probably toasted) text_array field with a
probability of only 1% each time.  Perhaps making that more likely (by
changing line 186 of count.pl) would make it easier to trigger the
bug.  I'll try that in my next iteration of tests.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Sooner or later we are going to need to go to 8-byte TOAST object
>> identifiers.  Maybe we should think about doing that sooner not later
>> rather than trying to invent some anti-wraparound solution here.

> Umm, it seems to me like we need this fixed in supported branches, not
> just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at
> this point.

I think the problem is pretty hypothetical until you get to consuming a
substantial part of the OID space within any one toast table, at which
point you're going to need 8-byte toast OIDs.  Improving that situation
seems like something we can define as a future feature.

(Note that I do not believe for a moment that Andres has actually
exhibited such a problem in his test case.  He'd have needed several
TB worth of TOAST space to get to that point.)

>> In principle, you could support existing TOAST tables and pointers
>> containing 4-byte IDs in parallel with the new ones.  Not sure how
>> pg_upgrade would handle it exactly though.

> I suppose the real problem is that there's no way to have a mix of 4-
> and 8-byte identifiers in the same toast table.  I suppose we could have
> two toast tables for each regular table, but that sounds pretty painful.

Hmm.  It's not impossible by any means, since you could tell by looking
at a toast OID which table it was in.  And that way might provide a less
painful upgrade process for an existing table that grew to the point of
needing bigger OIDs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Konstantin Knizhnik

On 05/10/2016 08:26 PM, Robert Haas wrote:

On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
 wrote:

What's wrong with it that worker is blocked? You can just have more workers
(more than CPU cores) to let other of them continue to do useful work.

Not really.  The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next.  If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.


Doesn't this actually mean that we need to have normal job scheduler which is given queue of jobs and having some pool of threads will be able to orginize efficient execution of queries? Optimizer can build pipeline (graph) of tasks, which corresponds to 
execution plan nodes, i.e. SeqScan, Sort, ... Each task is splitted into several jobs which can be concurretly scheduled by task dispatcher.  So you will not have blocked worker waiting for something and all system resources will be utilized. Such approach 
with dispatcher allows to implement quotas, priorities,... Also dispatches can care about NUMA and cache optimizations which is especially critical on modern architectures. One more reference: http://db.in.tum.de/~leis/papers/morsels.pdf


Sorry, may be I wrong, but I still think that async.ops is "multitasking for 
poor":)
Yes, maintaining threads and especially separate processes adds significant overhead. It leads to extra resources consumption and context switches are quite expensive. And I know from my own experience that replacing several concurrent processes performing 
some IO (for example with sockets) with just one process using multiplexing allows to increase performance. But still async. ops. is just a way to make programmer responsible for managing state machine instead of relying on OS tomake context switches. 
Manual transmission is still more efficient than automatic transmission. But still most drives prefer last one;)


Seriously, I carefully read your response to Kochei, but still not convinced 
that async. ops. is what we need.  Or may be we just understand different thing 
by this notion.





But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation
native code for query execution plan (which is not so difficult now, thanks
to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL.
Instead of native code generation I used conversion of query plans to C code
and experiment with different data representation. "Horisontal model" with
loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.


At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...

Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Kevin Grittner
On Tue, May 10, 2016 at 11:13 AM, Tomas Vondra
 wrote:
> On 05/10/2016 10:29 AM, Kevin Grittner wrote:
>> On Mon, May 9, 2016 at 9:01 PM, Tomas Vondra  
>> wrote:

>>> * It's also seems to me the feature greatly amplifies the
>>> variability of the results, somehow. It's not uncommon to see
>>> results like this:
>>>
>>>  master-10-new-2235516 331976133316155563133396
>>>
>>> where after the first runs (already fairly variable) the
>>> performance tanks to ~50%. This happens particularly with higher
>>> client counts, otherwise the max-min is within ~5% of the max.
>>> There are a few cases where this happens without the feature
>>> (i.e. old master, reverted or disabled), but it's usually much
>>> smaller than with it enabled (immediate, 10 or 60). See the
>>> 'summary' sheet in the ODS spreadsheet.

Just to quantify that with standard deviations:

standard deviation - revert
scale1   16   32  64  128
100386 1874 3661810026587
3000   609 2236 4570897441004
1  257 4356 1350 89112909

standard deviation - disabled
scale1   16   32  64  128
100641 1924 2983   12575 9411
3000   206 2321 5477238045779
1 22361037611439965310436

>>> I don't know what's the problem here - at first I thought that
>>> maybe something else was running on the machine, or that
>>> anti-wraparound autovacuum kicked in, but that seems not to be
>>> the case. There's nothing like that in the postgres log (also
>>> included in the .tgz).
>>
>> I'm inclined to suspect NUMA effects.  It would be interesting to
>> try with the NUMA patch and cpuset I submitted a while back or with
>> fixes in place for the Linux scheduler bugs which were reported
>> last month.  Which kernel version was this?
>
> I can try that, sure. Can you point me to the last versions of the
> patches, possibly rebased to current master if needed?

The initial thread (for explanation and discussion context) for my
attempt to do something about some NUMA problems I ran into is at:

http://www.postgresql.org/message-id/flat/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com

Note that in my tests at the time, the cpuset configuration made a
bigger difference than the patch, and both together typically only
made about a 2% difference in the NUMA test environment I was
using.  I would sometimes see a difference as big as 20%, but had
no idea how to repeat that.

> The kernel is 3.19.0-031900-generic

So that kernel is recent enough to have acquired the worst of the
scheduling bugs, known to slow down one NASA high-concurrency
benchmark by 138x.  To quote from the recent paper by Lozi, et
al[1]:

| The  Missing Scheduling Domains bug causes all threads of the
| applications to run on a single node instead of eight. In some
| cases, the performance impact is greater than the 8x slowdown
| that one would expect, given that the threads are getting 8x less
| CPU time than they would without the bug (they run on one node
| instead of eight). lu, for example, runs 138x faster!
| Super-linear slowdowns  occur  in  cases  where  threads
| frequently synchronize using locks or barriers: if threads spin
| on a lock held by a descheduled thread, they will waste even more
| CPU time, causing cascading effects on the entire application’s
| performance. Some applications do not scale ideally to 64 cores
| and are thus a bit less impacted by the bug. The minimum slowdown
| is 4x.

The bug is only encountered if cores are disabled and re-enabled,
though, and I have no idea whether that might have happened on your
machine.  Since you're on a vulnerable kernel version, you might
want to be aware of the issue and take care not to trigger the
problem.

You are only vulnerable to the Group Imbalance bug if you use
autogroups.  You are only vulnerable to the Scheduling Group
Construction bug if you have more than one hop from any core to any
memory segment (which seems quite unlikely with 4 sockets and 4
memory nodes).

If you are vulnerable to any of the above, it might explain some of
the odd variations.  Let me know and I'll see if I can find more on
workarounds or OS patches.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud,
Vivien Quéma, Alexandra Fedorova. The Linux Scheduler: a Decade
of Wasted Cores.  In Proceedings of the 11th European
Conference on Computer Systems, EuroSys’16. April, 2016,
London, UK.
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Simon Riggs
On 10 May 2016 at 17:20, Andres Freund  wrote:

> On 2016-05-10 12:28:57 +0200, Simon Riggs wrote:
> > On 10 May 2016 at 09:05, Andres Freund  wrote:
> >
> >
> > > Is anybody ready with a good defense for SatisfiesToast not doing any
> > > actual liveliness checks?
> > >
> >
> > I provided a patch earlier that rechecks the OID fetched from a toast
> chunk
> > matches the OID requested.
>
> They match in this case, so that's not likely to help with the issue at
> hand?
>

I can see, but I was answering your more general question above with what I
had available.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] what to revert

2016-05-10 Thread Tomas Vondra

Hi,

On 05/10/2016 07:36 PM, Robert Haas wrote:

On Tue, May 10, 2016 at 12:31 PM, Tomas Vondra
 wrote:

The following table shows the differences between the disabled and reverted
cases like this:

sum('reverted' results with N clients)
    - 1.0
sum('disabled' results with N clients)

for each scale/client count combination. So for example 4.83% means with a
single client on the smallest data set, the sum of the 5 runs for reverted
was about 1.0483x than for disabled.

scale1   16   32  64  128
100  4.83%2.84%1.21%   1.16%3.85%
3000 1.97%0.83%1.78%   0.09%7.70%
1   -6.94%   -5.24%  -12.98%  -3.02%   -8.78%


/me scratches head.

That doesn't seem like noise, but I don't understand the
scale-factor-1 results either.  Reverting the patch makes the code
smaller and removes instructions from critical paths, so it should
speed things up at least nominally.  The question is whether it makes
enough difference that anyone cares.  However, removing unused code
shouldn't make the system *slower*, but that's what's happening here
at the higher scale factor.


/me scratches head too



I've seen cases where adding dummy instructions to critical paths
slows things down at 1 client and speeds them up with many clients.
That happens because the percentage of time active processes fighting
over the critical locks goes down, which reduces contention more than
enough to compensate for the cost of executing the dummy
instructions. If your results showed performance lower at 1 client
and slightly higher at many clients, I'd suspect an effect of that
sort. But I can't see why it should depend on the scale factor. That
suggests that, perhaps, it's having some effect on the impact of
buffer eviction, maybe due to a difference in shared memory layout.
But I thought we weren't supposed to have such artifacts any more
now that we start every allocation on a cache line boundary...



I think we should look for issues in the testing procedure first, 
perhaps try to reproduce it on a different system. Another possibility 
is that the revert is not perfectly correct - the code compiles and does 
not crash, but maybe there's a subtle issue somewhere.


I'll try to collect some additional info (detailed info from sar, 
aggregated transaction log, ...) for further analysis. And also increase 
the number of runs, so that we can better compare all the separate 
combinations.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Alvaro Herrera
Tom Lane wrote:
> Robert Haas  writes:
> > On Tue, May 10, 2016 at 12:19 PM, Andres Freund  wrote:
> >> It's not super likely, yea. But you don't really need to "use" 4 billion
> >> oids to get a wraparound. Once you have a significant number of values
> >> in various toast tables, the oid counter progresses really rather fast,
> >> without many writes. That's because the oid counter is global, but each
> >> individual toast write (and other things), perform checks via
> >> GetNewOidWithIndex().
> 
> > Understood.
> 
> Sooner or later we are going to need to go to 8-byte TOAST object
> identifiers.  Maybe we should think about doing that sooner not later
> rather than trying to invent some anti-wraparound solution here.

Umm, it seems to me like we need this fixed in supported branches, not
just 9.7, so I don't think 8-byte toast IDs are a reasonable solution at
this point.

> In principle, you could support existing TOAST tables and pointers
> containing 4-byte IDs in parallel with the new ones.  Not sure how
> pg_upgrade would handle it exactly though.

I suppose the real problem is that there's no way to have a mix of 4-
and 8-byte identifiers in the same toast table.  I suppose we could have
two toast tables for each regular table, but that sounds pretty painful.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Tom Lane
Robert Haas  writes:
> On Tue, May 10, 2016 at 12:19 PM, Andres Freund  wrote:
>> It's not super likely, yea. But you don't really need to "use" 4 billion
>> oids to get a wraparound. Once you have a significant number of values
>> in various toast tables, the oid counter progresses really rather fast,
>> without many writes. That's because the oid counter is global, but each
>> individual toast write (and other things), perform checks via
>> GetNewOidWithIndex().

> Understood.

Sooner or later we are going to need to go to 8-byte TOAST object
identifiers.  Maybe we should think about doing that sooner not later
rather than trying to invent some anti-wraparound solution here.

In principle, you could support existing TOAST tables and pointers
containing 4-byte IDs in parallel with the new ones.  Not sure how
pg_upgrade would handle it exactly though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump dump catalog ACLs

2016-05-10 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
> Stephen Frost wrote:
> 
> > diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
> > index 1267afb..4a9b1bf 100644
> > --- a/src/bin/pg_dump/pg_dump.c
> > +++ b/src/bin/pg_dump/pg_dump.c
> > @@ -14992,9 +14992,10 @@ dumpTable(Archive *fout, TableInfo *tbinfo)
> >   "%s AS initrattacl "
> >   "FROM 
> > pg_catalog.pg_attribute at "
> >"JOIN pg_catalog.pg_class c ON 
> > (at.attrelid = c.oid) "
> > - "LEFT JOIN 
> > pg_init_privs pip ON "
> > + "LEFT JOIN 
> > pg_catalog.pg_init_privs pip ON "
> >   "(pip.classoid = "
> > -"(SELECT oid FROM pg_class WHERE relname = 
> > 'pg_class') AND "
> > + "(SELECT oid FROM 
> > pg_catalog.pg_class "
> > + "WHERE relname = 
> > 'pg_class') AND "
> >" at.attrelid = pip.objoid AND at.attnum = 
> > pip.objsubid) "
> >   "WHERE at.attrelid = 
> > '%u' AND "
> >   "NOT at.attisdropped "
> 
> The subquery comparing the OID of pg_class using only a condition on
> relname seems wrong; wouldn't it fail or produce wrong results if
> somebody creates a table named pg_class in another schema?  I think you
> should write the comparison like this instead:
>   classoid = 'pg_catalog.pg_class'::regclass

Errr, I could have sworn I was doing that, but clearly I'm not.

Will fix.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_dump dump catalog ACLs

2016-05-10 Thread Alvaro Herrera
Stephen Frost wrote:

> diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
> index 1267afb..4a9b1bf 100644
> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -14992,9 +14992,10 @@ dumpTable(Archive *fout, TableInfo *tbinfo)
> "%s AS initrattacl "
> "FROM 
> pg_catalog.pg_attribute at "
>  "JOIN pg_catalog.pg_class c ON 
> (at.attrelid = c.oid) "
> -   "LEFT JOIN 
> pg_init_privs pip ON "
> +   "LEFT JOIN 
> pg_catalog.pg_init_privs pip ON "
> "(pip.classoid = "
> -  "(SELECT oid FROM pg_class WHERE relname = 
> 'pg_class') AND "
> +   "(SELECT oid FROM 
> pg_catalog.pg_class "
> +   "WHERE relname = 
> 'pg_class') AND "
>  " at.attrelid = pip.objoid AND at.attnum = 
> pip.objsubid) "
> "WHERE at.attrelid = 
> '%u' AND "
> "NOT at.attisdropped "

The subquery comparing the OID of pg_class using only a condition on
relname seems wrong; wouldn't it fail or produce wrong results if
somebody creates a table named pg_class in another schema?  I think you
should write the comparison like this instead:
  classoid = 'pg_catalog.pg_class'::regclass

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Robert Haas
On Tue, May 10, 2016 at 12:19 PM, Andres Freund  wrote:
>> I assume that this was installed as a performance optimization, and I
>> don't really see why it shouldn't be or be able to be made safe.  I
>> assume that the wraparound case was deemed safe because at that time
>> the idea of 4 billion OIDs getting used with old transactions still
>> active seemed inconceivable.
>
> It's not super likely, yea. But you don't really need to "use" 4 billion
> oids to get a wraparound. Once you have a significant number of values
> in various toast tables, the oid counter progresses really rather fast,
> without many writes. That's because the oid counter is global, but each
> individual toast write (and other things), perform checks via
> GetNewOidWithIndex().

Understood.

> I'm not sure why you think it's safe? Consider the following scenario:
>
> BEGIN;
> -- nextoid: 1
> INSERT toastval = chunk_id = 1;
> ROLLBACK:
> ...
> -- oid counter wraps around
> -- nextoid: 1
> INSERT toastval = chunk_id = 1;

Uh oh.  That's really bad.  As long as we vacuum the table every 4
billion OID counter uses, and no long-running transactions prevent us
from doing cleanup, there will be no issue of this sort, but there's
no guarantee of those things being true on modern hardware.  And I
doubt we want to have a relmindeadoid to go with relfrozenxid and
relminmxid.  Our last round of convincing VACUUM to take into account
one more kind of wraparound prevention wasn't loads of fun.   Not to
mention the fact that this could wrap FAR faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Robert Haas
On Tue, May 10, 2016 at 12:31 PM, Tomas Vondra
 wrote:
> The following table shows the differences between the disabled and reverted
> cases like this:
>
> sum('reverted' results with N clients)
> - 1.0
> sum('disabled' results with N clients)
>
> for each scale/client count combination. So for example 4.83% means with a
> single client on the smallest data set, the sum of the 5 runs for reverted
> was about 1.0483x than for disabled.
>
> scale1   16   32  64  128
> 100  4.83%2.84%1.21%   1.16%3.85%
> 3000 1.97%0.83%1.78%   0.09%7.70%
> 1   -6.94%   -5.24%  -12.98%  -3.02%   -8.78%

/me scratches head.

That doesn't seem like noise, but I don't understand the
scale-factor-1 results either.  Reverting the patch makes the code
smaller and removes instructions from critical paths, so it should
speed things up at least nominally.  The question is whether it makes
enough difference that anyone cares.  However, removing unused code
shouldn't make the system *slower*, but that's what's happening here
at the higher scale factor.

I've seen cases where adding dummy instructions to critical paths
slows things down at 1 client and speeds them up with many clients.
That happens because the percentage of time active processes fighting
over the critical locks goes down, which reduces contention more than
enough to compensate for the cost of executing the dummy instructions.
If your results showed performance lower at 1 client and slightly
higher at many clients, I'd suspect an effect of that sort.  But I
can't see why it should depend on the scale factor.  That suggests
that, perhaps, it's having some effect on the impact of buffer
eviction, maybe due to a difference in shared memory layout.  But I
thought we weren't supposed to have such artifacts any more now that
we start every allocation on a cache line boundary...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Robert Haas
On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
 wrote:
> What's wrong with it that worker is blocked? You can just have more workers
> (more than CPU cores) to let other of them continue to do useful work.

Not really.  The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next.  If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.

> But there are some researches, for example:
>
> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
>
> showing that the same or even better effect can be achieved by generation
> native code for query execution plan (which is not so difficult now, thanks
> to LLVM).
> It eliminates interpretation overhead and increase cache locality.
> I get similar results with my own experiments of accelerating SparkSQL.
> Instead of native code generation I used conversion of query plans to C code
> and experiment with different data representation. "Horisontal model" with
> loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.

> At this moment (February) them have implemented translation of only few
> PostgreSQL operators used by ExecQuals  and do not support aggregates.
> Them get about 2 times increase of speed at synthetic queries and 25%
> increase at TPC-H Q1 (for Q1 most critical is generation of native code for
> aggregates, because ExecQual itself takes only 6% of time for this query).
> Actually these 25% for Q1 were achieved not by using dynamic code
> generation, but switching from PULL to PUSH model in executor.
> It seems to be yet another interesting PostgreSQL executor transformation.
> As far as I know, them are going to publish result of their work to open
> source...

Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Robert Haas
On Mon, May 9, 2016 at 9:38 PM, Kouhei Kaigai  wrote:
> Is the parallel aware Append node sufficient to run multiple nodes
> asynchronously? (Sorry, I couldn't have enough time to code the feature
> even though we had discussion before.)

It's tempting to think that parallel query and asynchronous query are
the same thing, but I think that they are actually quite different.
Parallel query involves using multiple processes to service a query.
Asynchronous query involves using each individual process as
efficiently as possible by not having it block any more than
necessary.  You can want these things together or separately.  For
example, consider this query plan:

Append
-> ForeignScan
-> ForeignScan

Here, you do not want parallel query; the queries must both be
launched by the user backend, not some worker process, else you will
not get the right transaction semantics.  The parallel-aware Append
node we talked about before does not help here.  On the other hand,
consider this:

Append
  -> Seq Scan
   Filter: lots_of_cpu()
  -> Seq Scan
   Filter: lots_of_cpu()

Here, asynchronous query is of no help, but parallel query is great.
Now consider this hypothetical plan:

Gather
-> Hash Join
  -> Parallel Bitmap Heap Scan
-> Bitmap Index Scan
  -> Parallel Hash
-> Parallel Seq Scan

Let's assume that the bitmap *heap* scan can be performed in parallel
but the bitmap *index* scan can't.  That's pretty reasonable for a
first cut, actually, because the index accesses are probably touching
much less data, and we're doing little CPU work for each index page -
any delay here is likely to be I/O.

So in that world what you want is for the first worker to begin
performing the bitmap index scan and building a shared TIDBitmap for
that the workers can use to scan the heap.  The other workers,
meanwhile, could begin building the shared hash table (which is what I
intend to denote by saying that it's a *Parallel* Hash).  If the
process building the bitmap finishes before the hash table is built,
it can help build the hash table as well.  Once both operations are
done, each process can scan a subset of the rows from the bitmap heap
scan and do the hash table probes for just those rows.  To make all of
this work, you need both *parallel* query, so that you have workers,
and also *asynchronous* query, so that workers which see that the
bitmap index scan is in progress don't get stuck waiting for it but
can look around for other work.

> In the above example, scan on foreign-table takes longer lead time than
> local scan. If Append can map every child nodes on individual workers,
> local scan worker begins to return tuples at first, then, mixed tuples
> shall be returned eventually.

This is getting at an issue I'm somewhat worried about, which is
scheduling.  In my prototype, we only check for events if there are no
nodes ready for the CPU now, but sometimes that might be a loser.  One
probably needs to check for events periodically even when there are
still nodes waiting for the CPU, but "how often?" seems like an
unanswerable question.

> However, the process internal asynchronous execution may be also beneficial
> in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
> are given to worker process). I think it allows to implement pre-fetching
> very naturally.

Yes.

>> My proposal for how to do this is to make ExecProcNode function as a
>> backward-compatibility wrapper.  For asynchronous execution, a node
>> might return a not-ready-yet indication, but if that node is called
>> via ExecProcNode, it means the caller isn't prepared to receive such
>> an indication, so ExecProcNode will just wait for the node to become
>> ready and then return the tuple.
>>
> Backward compatibility is good. In addition, child node may want to
> know the context when it is called. It may want to switch the behavior
> according to the caller's expectation. For example, it may be beneficial
> if SeqScan makes more aggressive prefetching on asynchronous execution.

Maybe, but I'm a bit doubtful.  I'm not seeing a lot of advantage in
that sort of thing, and it will make the code a lot more complicated.

> Also, can we consider which data format will be returned from the child
> node during the planning stage? It affects to the cost of inter-node
> data exchange. If a pair of parent-node and child-node supports its
> special data format (like as existing HashJoin and Hash doing), it shall
> be a discount factor of cost estimation.

I'm not sure.  The costing aspects of this need a lot more thought
than I have given them so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-05-10 Thread Craig Ringer
On 11 May 2016 at 01:15, Vladimir Gordiychuk  wrote:

> in which release can be include first part?
>

Since it's not a bug fix, I don't think it can go in before 9.7.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Stopping logical replication protocol

2016-05-10 Thread Vladimir Gordiychuk
in which release can be include first part?

2016-05-10 15:15 GMT+03:00 Craig Ringer :

> On 10 May 2016 at 19:41, Vladimir Gordiychuk  wrote:
>
>>
>> Fair enough. Though I don't understand why you'd be doing this often
>>> enough that you'd care about reopening connections. What is the problem you
>>> are trying to solve with this? The underlying reason you need this change?
>>>
>>
>> First reason it clear API in pgjdc. Second reason it ability fast enough
>> rollack to one of the previous LSN and repeat it. My use case for second
>> reason - I have logical decoding extension that prepare only data as
>> key-value pair without information about (insert, update, delete) for
>> example if it delete as key I use primary key for table and as value null. 
>> Via
>> pgjdc by replication protocol connects receiver data, consumer group
>> changes to batch and send it to Kafka. If some problem occurs during
>> delivery to kafka consumer, I should stop current replication, go back to
>> success LSN and repeat all messages from success LSN. I think it operation
>> can be quite common, but reopen connection or not stopping replication
>> will increase latency.
>>
>
> It will, but not tons. The biggest cost (at least if there are any long
> running xacts) will be replaying since the restart_lsn when you restart the
> decoding session, which will happen whether you reconnect or just stop
> decoding and return to command mode.
>
>
>
>> Anyway, here's a draft patch that does the parts other than the reorder
>>> buffer processing stop. It passes 'make check' and the src/test/recovery
>>> tests, but I haven't written anything to verify the client-initiated abort
>>> handling. You have test code for this and I'd be interested to see the
>>> results.
>>>
>>
>> What about keepAlive package when we already send/receive CopyDone? Is It
>> really necessary?
>>
>
> No, I don't think it is, and I applied the change you made to suppress it.
>
>
>> I think it's worth making the next step, where you allow reorder buffer
>>> commit processing to be interrupted, into a separate patch on top of this
>>> one. They're two separate changes IMO.
>>>
>>
>> We will continue in the current thread, or new? I interesting in both
>> patch for my solution and pgjbc driver.
>>
>
> Same thread, I just think these are two somewhat separate changes. One is
> just in the walsender and allows return to command mode during waiting for
> WAL. The other is more intrusive into the reorder buffer etc and allows
> aborting decoding during commit processing. So two separate patches make
> sense here IMO, one on top of the other.
>
> I use git and just "git format-patch -2" to prepare a stack of two patches
> from two separate commits.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] pg_dump dump catalog ACLs

2016-05-10 Thread Stephen Frost
* Noah Misch (n...@leadboat.com) wrote:
> On Tue, Apr 05, 2016 at 05:50:18PM -0400, Stephen Frost wrote:
> > @@ -14187,18 +14869,65 @@ dumpTable(Archive *fout, TableInfo *tbinfo)
> 
> > + "FROM 
> > pg_catalog.pg_attribute at "
> > +  "JOIN pg_catalog.pg_class c ON 
> > (at.attrelid = c.oid) "
> > + "LEFT JOIN 
> > pg_init_privs pip ON "
> 
> Since pg_attribute and pg_class require schema qualification here, so does
> pg_init_privs.  Likewise elsewhere in the patch's pg_dump changes.

Attached is a patch to qualify the tables used in that query.  I
reviewed all of the other pg_init_privs uses and they all match the
qualification level of the other tables in those queries.  This isn't
too surprising as this is the only query which happens outside of the
"get*()" functions.

This patch doesn't do anything but qualify the table usage and that
query is checked by the pg_dump regression suite and ran fine for me,
so, if there are no objections, I'll push this later on this afternoon.

Thanks!

Stephen
From b8032e96ad1865a824ea2feb2f2a34f9b29ac6ac Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Tue, 10 May 2016 12:55:11 -0400
Subject: [PATCH] Qualify table usage in dumpTable()

All of the other tables used in the query in dumpTable(), which is
collecting column-level ACLs, are qualified, so we should be qualifying
the pg_init_privs and related sub-select against pg_class too.

Pointed out by Noah, patch by me.
---
 src/bin/pg_dump/pg_dump.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 1267afb..4a9b1bf 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -14992,9 +14992,10 @@ dumpTable(Archive *fout, TableInfo *tbinfo)
 			  "%s AS initrattacl "
 			  "FROM pg_catalog.pg_attribute at "
 	   "JOIN pg_catalog.pg_class c ON (at.attrelid = c.oid) "
-			  "LEFT JOIN pg_init_privs pip ON "
+			  "LEFT JOIN pg_catalog.pg_init_privs pip ON "
 			  "(pip.classoid = "
- "(SELECT oid FROM pg_class WHERE relname = 'pg_class') AND "
+			  "(SELECT oid FROM pg_catalog.pg_class "
+			  "WHERE relname = 'pg_class') AND "
    " at.attrelid = pip.objoid AND at.attnum = pip.objsubid) "
 			  "WHERE at.attrelid = '%u' AND "
 			  "NOT at.attisdropped "
-- 
2.5.0



signature.asc
Description: Digital signature


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Robert Haas
On Mon, May 9, 2016 at 8:34 PM, David Rowley
 wrote:
> It's interesting that you mention this. We identified this as a pain
> point during our work on column stores last year. Simply passing
> single tuples around the executor is really unfriendly towards L1
> instruction cache, plus also the points you mention about L3 cache and
> hash tables and tuple stores. I really think that we're likely to see
> significant gains by processing >1 tuple at a time, so this topic very
> much interests me.

Cool.  I hope we can work together on it, and with anyone else who is
interested.

> When we start multiplying those increases with the increases with
> something like parallel query then we're starting to see very nice
> gains in performance.

Check.

> Alvaro, Tomas and I had been discussing this and late last year I did
> look into what would be required to allow this to happen in Postgres.
> Basically there's 2 sub-projects, I'll describe what I've managed to
> learn so far about each, and the rough plan that I have to implement
> them:
>
> 1. Batch Execution:
>
> a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
> b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
> indicate if the struct contains a single or a multiple tuples.
> Multiple tuples may need to be deformed in a non-lazy fashion in order
> to prevent too many buffers from having to be pinned at once. Tuples
> will be deformed into arrays of each column rather than arrays for
> each tuple (this part is important to support the next sub-project)
> c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
> process a batch TupleTableSlot. This will require some tight loop to
> aggregate the entire TupleTableSlot at once before returning.
> d. Add function in execAmi.c which returns true or false depending on
> if the node supports batch TupleTableSlots or not.
> e. At executor startup determine if the entire plan tree supports
> batch TupleTableSlots, if so enable batch scan mode.

I'm wondering if we should instead have a whole new kind of node, a
TupleTableVector, say.  Things that want to work one tuple at a time
can continue to do so with no additional overhead.  Things that want
to return batches can do it via this new result type.

> node types, which *may* not be all that difficult. I'm also assuming
> that batch mode (in all cases apart from queries with LIMIT or
> cursors) will always be faster than tuple-at-a-time, so requires no
> costings from the planner.

I definitely agree that we need to consider cases with and without
LIMIT separately, but there's more than one way to get a LIMIT.  For
example, a subquery returns only one row; a semijoin returns only one
row.  I don't think you are going to escape planner considerations.

Nested Loop Semi Join
-> Seq Scan
-> Index Scan on dont_batch_here

> 2. Vector processing
>
> (I admit that I've given this part much less thought so far, but
> here's what I have in mind)
>
> This depends on batch execution, and is intended to allow the executor
> to perform function calls to an entire batch at once, rather than
> tuple-at-a-time. For example, let's take the following example;
>
> SELECT a+b FROM t;
>
> here (as of now) we'd scan "t" one row at a time and perform a+b after
> having deformed enough of the tuple to do that. We'd then go and get
> another Tuple from the scan node and repeat until the scan gave us no
> more Tuples.
>
> With batch execution we'd fetch multiple Tuples from the scan and we'd
> then perform the call to say int4_pl() multiple times, which still
> kinda sucks as it means calling int4_pl() possibly millions of times
> (once per tuple). The vector mode here would require that we modify
> pg_operator to add a vector function for each operator so that we can
> call the function passing in an array of Datums and a length to have
> SIMD operations perform the addition, so we'd call something like
> int4_pl_vector() only once per batch of tuples allowing the CPU to
> perform SIMD operations on those datum arrays. This could be done in
> an incremental way as the code could just callback on the standard
> function in cases where a vectorised version of it is not available.
> Thought is needed here about when exactly this decision is made as the
> user may not have permissions to execute the vector function, so it
> can't simply be a run time check.  These functions would simply return
> another vector of the results.  Aggregates could be given a vector
> transition function, where something like COUNT(*)'s vector_transfn
> would simply just current_count += vector_length;
>
> This project does appear to require that we bloat the code with 100's
> of vector versions of each function. I'm not quite sure if there's a
> better way to handle this. The problem is that the fmgr is pretty much
> a barrier to SIMD operations, and this was the only idea that I've had
> so far about breaking through that 

Re: [HACKERS] what to revert

2016-05-10 Thread Tomas Vondra



On 05/10/2016 03:04 PM, Kevin Grittner wrote:

On Tue, May 10, 2016 at 3:29 AM, Kevin Grittner > wrote:


* The results are a bit noisy, but I think in general this shows
that for certain cases there's a clearly measurable difference
(up to 5%) between the "disabled" and "reverted" cases. This is
particularly visible on the smallest data set.


In some cases, the differences are in favor of disabled over
reverted.


There were 75 samples each of "disabled" and "reverted" in the
spreadsheet.  Averaging them all, I see this:

reverted:  290,660 TPS
disabled:  292,014 TPS


Well, that kinda assumes it's one large group. I was wondering whether 
the difference depends on some of the other factors (scale factor, 
number of clients), which is why I mentioned "for certain cases".


The other problem is averaging the difference like this overweights the 
results for large client counts. Also, it mixes results for different 
scales, which I think is pretty important.


The following table shows the differences between the disabled and 
reverted cases like this:


sum('reverted' results with N clients)
    - 1.0
sum('disabled' results with N clients)

for each scale/client count combination. So for example 4.83% means with 
a single client on the smallest data set, the sum of the 5 runs for 
reverted was about 1.0483x than for disabled.


scale1   16   32  64  128
100  4.83%2.84%1.21%   1.16%3.85%
3000 1.97%0.83%1.78%   0.09%7.70%
1   -6.94%   -5.24%  -12.98%  -3.02%   -8.78%

So in average for each scale;

scalerevert/disable
100   2.78%
3000  2.47%
1-7.39%

Of course, it still might be due to noise. But looking at the two tables 
that seems rather unlikely, I guess.




That's a 0.46% overall increase in performance with the patch,
disabled, compared to reverting it.  I'm surprised that you
consider that to be a "clearly measurable difference".  I mean, it
was measured and it is a difference, but it seems to be well within
the noise.  Even though it is based on 150 samples, I'm not sure we
should consider it statistically significant.


Well, luckily we're in the position that we can collect more data.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 10:09:07 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> 
> As long as we do not update toast values after creation, there is no
> need; the liveness check on the parent tuple is what's important.
> Adding a liveness check on the toast item would merely create new
> failure modes with no corresponding benefit.

You mean besides not having a corrupted database like in this case?


> Imagine deciding that field 3 of a regular tuple was somehow dead even
> though the rest of the tuple is live --- how can that be good?

How would ever happen in normal operation? And why are we checking hint
bits in HeapTupleSatisfiesToast() in that case?
if (!HeapTupleHeaderXminCommitted(tuple))
{
if (HeapTupleHeaderXminInvalid(tuple))
return false;

The only way I can see the scenario you're described happening is if
there's a bug somewhere else.


> I concur with Robert that what this looks like is failure to ensure
> that toast OIDs are unique, which is an entirely different problem.

Well, I did describe how this can come about due to a wraparound, and
how GetNewOidWithIndex isn't sufficient to prevent the problem, as it
only hints (which are by definition not durable)...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Alvaro Herrera
Tomas Vondra wrote:

> Yes, I'd like to repeat the tests with other workloads - I'm thinking about
> regular pgbench and perhaps something that'd qualify as 'mostly read-only'
> (not having a clear idea how that should work).

You can use "-bselect-only@9 -bsimple-update@1" for a workload that's
10% the write transactions and 90% the read-only transactions.  If you
have custom .sql scripts you can use the @ with -f too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 12:28:57 +0200, Simon Riggs wrote:
> On 10 May 2016 at 09:05, Andres Freund  wrote:
> 
> 
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> >
> 
> I provided a patch earlier that rechecks the OID fetched from a toast chunk
> matches the OID requested.

They match in this case, so that's not likely to help with the issue at
hand?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
On 2016-05-10 08:09:02 -0400, Robert Haas wrote:
> On Tue, May 10, 2016 at 3:05 AM, Andres Freund  wrote:
> > The easy way to trigger this problem would be to have an oid wraparound
> > - but the WAL shows that that's not the case here.  I've not figured
> > that one out entirely (and won't tonight). But I do see WAL records
> > like:
> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
> > i.e. two NEXTOID records allocating the same range, which obviously
> > doesn't seem right.  There's also every now and then close by ranges:
> > rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> > 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455
> > rmgr: XLOGlen (rec/tot):  4/30, tx:7814505, lsn: 
> > 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461
> >
> >
> > As far as I can see something like the above, or an oid wraparound, are
> > pretty much deadly for toast.
> >
> > Is anybody ready with a good defense for SatisfiesToast not doing any
> > actual liveliness checks?
> 
> I assume that this was installed as a performance optimization, and I
> don't really see why it shouldn't be or be able to be made safe.  I
> assume that the wraparound case was deemed safe because at that time
> the idea of 4 billion OIDs getting used with old transactions still
> active seemed inconceivable.

It's not super likely, yea. But you don't really need to "use" 4 billion
oids to get a wraparound. Once you have a significant number of values
in various toast tables, the oid counter progresses really rather fast,
without many writes. That's because the oid counter is global, but each
individual toast write (and other things), perform checks via
GetNewOidWithIndex().

I'm not sure why you think it's safe? Consider the following scenario:

BEGIN;
-- nextoid: 1
INSERT toastval = chunk_id = 1;
ROLLBACK:
...
-- oid counter wraps around
-- nextoid: 1
INSERT toastval = chunk_id = 1;

-- crash, loosing all pending hint bits

SELECT toastval;

The last SELECT might find either of the toasted data chunks, depending
on heap ordering. As they're not hinted as invalid due to the crash,
HeapTupleSatisfiesToast() will return both as visible.


To make that safe we'd at least make hint bit writes by the scan in
GetNewOidWithIndex() durable, and likely also disable the killtuples
optimization; to avoid a plain SELECT of the toast table to make some
tuples unreachable, but not durably hinted.


That seems fairly fragile. I've a significant amount of doubt that toast
reads are bottlenecked by visibility routines.


> It seems to me that the real question
> here is how you're getting two calls to XLogPutNextOid() with the same
> value of ShmemVariableCache->nextOid, and the answer, as it seems to
> me, must be that LWLocks are broken.

There likely were a bunch of crashes in between, Jeff's test suite
triggers them at a high rate. It seems a lot more likely than that an
lwlock bug only materializes in the oid counter.  Investigating.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Tomas Vondra

Hi,

On 05/10/2016 10:29 AM, Kevin Grittner wrote:

On Mon, May 9, 2016 at 9:01 PM, Tomas Vondra
> wrote:


Over the past few days I've been running benchmarks on a fairly
large NUMA box (4 sockets, 32 cores / 64 with HR, 256GB of RAM)
to see the impact of the 'snapshot too old' - both when disabled
and enabled with various values in the old_snapshot_threshold
GUC.


Thanks!


The benchmark is a simple read-only pgbench with prepared
statements, i.e. doing something like this:

   pgbench -S -M prepared -j N -c N


Do you have any plans to benchmark cases where the patch can have a
benefit?  (Clearly, nobody would be interested in using the feature
with a read-only load; so while that makes a good "worst case"
scenario and is very valuable for testing the "off" versus
"reverted" comparison, it's not an intended use or one that's
likely to happen in production.)


Yes, I'd like to repeat the tests with other workloads - I'm thinking 
about regular pgbench and perhaps something that'd qualify as 'mostly 
read-only' (not having a clear idea how that should work).


Feel free to propose other tests.




master-10-new - 91fd1df4 + old_snapshot_threshold=10
master-10-new-2 - 91fd1df4 + old_snapshot_threshold=10 (rerun)


So, these runs were with identical software on the same data? Any
differences are just noise?


Yes, same config. The differences are either noise or something 
unexpected (like the sudden drops of tps with high client counts).



* The results are a bit noisy, but I think in general this shows
that for certain cases there's a clearly measurable difference
(up to 5%) between the "disabled" and "reverted" cases. This is
particularly visible on the smallest data set.


In some cases, the differences are in favor of disabled over
reverted.


Well, that's a good question. I think the results for higher client 
counts (>=64) are fairy noisy, so in those cases it may easily be just 
due to noise. For the lower client counts that seems to be much less 
noisy though.





* What's fairly strange is that on the largest dataset (scale
1), the "disabled" case is actually consistently faster than
"reverted" - that seems a bit suspicious, I think. It's possible
that I did the revert wrong, though - the revert.patch is
included in the tgz. This is why I also tested 689f9a05, but
that's also slower than "disabled".


Since there is not a consistent win of disabled or reverted over
the other, and what difference there is is often far less than the
difference between the two runs with identical software, is there
any reasonable interpretation of this except that the difference is
"in the noise"?


Are we both looking at the results for scale 1? I think there's 
pretty clear difference between "disabled" and "reverted" (or 68919a05, 
for that matter). The gap is also much larger compared to the two 
"identical" runs (ignoring the runs with 128 clients).





* The performance impact with the feature enabled seems rather
significant, especially once you exceed the number of physical
cores (32 in this case). Then the drop is pretty clear - often
~50% or more.

* 7e3da1c4 claims to bring the performance within 5% of the
disabled case, but that seems not to be the case.


The commit comment says "At least in the tested case this brings
performance within 5% of when the feature is off, compared to
several times slower without this patch."  The tested case was a
read-write load, so your read-only tests do nothing to determine
whether this was the case in general for this type of load.
Partly, the patch decreases chasing through HOT chains and
increases the number of HOT updates, so there are compensating
benefits of performing early vacuum in a read-write load.


OK. Sadly the commit message does not mention what the tested case was, 
so I wasn't really sure ...





What it however does is bringing the 'non-immediate' cases close
to the immediate ones (before the performance drop came much
sooner in these cases - at 16 clients).


Right.  This is, of course, just the first optimization, that we
were able to get in "under the wire" before beta, but the other
optimizations under consideration would only tend to bring the
"enabled" cases closer together in performance, not make an enabled
case perform the same as when the feature was off -- especially for
a read-only workload.


OK




* It's also seems to me the feature greatly amplifies the
variability of the results, somehow. It's not uncommon to see
results like this:

 master-10-new-2235516 331976133316155563133396

where after the first runs (already fairly variable) the
performance tanks to ~50%. This happens particularly with higher
client counts, otherwise the max-min is within ~5% of the max.
There are a few cases where this happens without the feature
(i.e. old master, reverted or disabled), but it's usually much
smaller than with it enabled (immediate, 10 or 60). 

Re: [HACKERS] what to revert

2016-05-10 Thread Kevin Grittner
On Tue, May 10, 2016 at 9:02 AM, Tom Lane  wrote:
> Kevin Grittner  writes:
>> There were 75 samples each of "disabled" and "reverted" in the
>> spreadsheet.  Averaging them all, I see this:
>
>> reverted:  290,660 TPS
>> disabled:  292,014 TPS
>
>> That's a 0.46% overall increase in performance with the patch,
>> disabled, compared to reverting it.  I'm surprised that you
>> consider that to be a "clearly measurable difference".  I mean, it
>> was measured and it is a difference, but it seems to be well within
>> the noise.  Even though it is based on 150 samples, I'm not sure we
>> should consider it statistically significant.
>
> You don't have to guess about that --- compare it to the standard
> deviation within each group.

My statistics skills are rusty, but I thought that just gives you
an effect size, not any idea of whether the effect is statistically
significant.

Does anyone with sharper skills in this area than I want to opine
on whether there is a statistically significant difference between
the numbers on "master-default-disabled" lines and "master-revert"
lines in the old_snap.ods file attached to an earlier post on this
thread?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Reviewing freeze map code

2016-05-10 Thread Robert Haas
On Mon, May 9, 2016 at 7:40 PM, Ants Aasma  wrote:
> On Mon, May 9, 2016 at 10:53 PM, Robert Haas  wrote:
>> On Sun, May 8, 2016 at 10:42 PM, Masahiko Sawada  
>> wrote:
>>> Attached draft patch adds SCANALL option to VACUUM in order to scan
>>> all pages forcibly while ignoring visibility map information.
>>> The option name is SCANALL for now but we could change it after got 
>>> consensus.
>>
>> If we're going to go that way, I'd say it should be scan_all rather
>> than scanall.  Makes it clearer, at least IMHO.
>
> Just to add some diversity to opinions, maybe there should be a
> separate command for performing integrity checks. Currently the best
> ways to actually verify database correctness do so as a side effect.
> The question that I get pretty much every time after I explain why we
> have data checksums, is "how do I check that they are correct" and we
> don't have a nice answer for that now. We could also use some ways to
> sniff out corrupted rows that don't involve crashing the server in a
> loop. Vacuuming pages that supposedly don't need vacuuming just to
> verify integrity seems very much in the same vein.
>
> I know right now isn't exactly the best time to hastily slap on such a
> feature, but I just wanted the thought to be out there for
> consideration.

I think that it's quite reasonable to have ways of performing an
integrity check that are separate from VACUUM, but this is about
having a way to force VACUUM to scan all-frozen pages - and it's hard
to imagine that we want a different command name for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Tom Lane
Andres Freund  writes:
> Is anybody ready with a good defense for SatisfiesToast not doing any
> actual liveliness checks?

As long as we do not update toast values after creation, there is no
need; the liveness check on the parent tuple is what's important.
Adding a liveness check on the toast item would merely create new
failure modes with no corresponding benefit.  Imagine deciding
that field 3 of a regular tuple was somehow dead even though the
rest of the tuple is live --- how can that be good?

I concur with Robert that what this looks like is failure to ensure
that toast OIDs are unique, which is an entirely different problem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Tom Lane
Kevin Grittner  writes:
> There were 75 samples each of "disabled" and "reverted" in the
> spreadsheet.  Averaging them all, I see this:

> reverted:  290,660 TPS
> disabled:  292,014 TPS

> That's a 0.46% overall increase in performance with the patch,
> disabled, compared to reverting it.  I'm surprised that you
> consider that to be a "clearly measurable difference".  I mean, it
> was measured and it is a difference, but it seems to be well within
> the noise.  Even though it is based on 150 samples, I'm not sure we
> should consider it statistically significant.

You don't have to guess about that --- compare it to the standard
deviation within each group.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] what to revert

2016-05-10 Thread Kevin Grittner
On Tue, May 10, 2016 at 3:29 AM, Kevin Grittner  wrote:

>> * The results are a bit noisy, but I think in general this shows
>> that for certain cases there's a clearly measurable difference
>> (up to 5%) between the "disabled" and "reverted" cases. This is
>> particularly visible on the smallest data set.
>
> In some cases, the differences are in favor of disabled over
> reverted.

There were 75 samples each of "disabled" and "reverted" in the
spreadsheet.  Averaging them all, I see this:

reverted:  290,660 TPS
disabled:  292,014 TPS

That's a 0.46% overall increase in performance with the patch,
disabled, compared to reverting it.  I'm surprised that you
consider that to be a "clearly measurable difference".  I mean, it
was measured and it is a difference, but it seems to be well within
the noise.  Even though it is based on 150 samples, I'm not sure we
should consider it statistically significant.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] GIN logging GIN_SEGMENT_UNMODIFIED actions?

2016-05-10 Thread Alexander Korotkov
Hi!

On Mon, May 9, 2016 at 10:46 PM, Andres Freund  wrote:

> trying to debug something I saw the following in pg_xlogdump output:
>
> rmgr: Gin len (rec/tot):  0/   274, tx:  0, lsn:
> 1C/DF28AEB0, prev 1C/DF289858, desc: VACUUM_DATA_LEAF_PAGE  3 segments: 5
> unknown action 0 ???, blkref #0: rel 1663/16384/16435 blk 310982
>
> note the "segments: 5 unknown action 0 ???" bit.  That doesn't seem
> right, given:
> #define GIN_SEGMENT_UNMODIFIED  0   /* no action (not used in
> WAL records) */
>

I've checked GIN code.  Have no idea of how such wal record could be
generated...
The only idea I have is to add check that we're inserting valid WAL record
immediately before XLogInsert().

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Patch for German translation

2016-05-10 Thread Robert Haas
On Mon, May 9, 2016 at 10:25 AM, Christian Ullrich  wrote:
> here is a patch for the German translation that removes (all) five instances
> of *the* most annoying mistake ever.

Wow, glad it got fixed.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-05-10 Thread Craig Ringer
On 10 May 2016 at 19:41, Vladimir Gordiychuk  wrote:

>
> Fair enough. Though I don't understand why you'd be doing this often
>> enough that you'd care about reopening connections. What is the problem you
>> are trying to solve with this? The underlying reason you need this change?
>>
>
> First reason it clear API in pgjdc. Second reason it ability fast enough
> rollack to one of the previous LSN and repeat it. My use case for second
> reason - I have logical decoding extension that prepare only data as
> key-value pair without information about (insert, update, delete) for
> example if it delete as key I use primary key for table and as value null. Via
> pgjdc by replication protocol connects receiver data, consumer group
> changes to batch and send it to Kafka. If some problem occurs during
> delivery to kafka consumer, I should stop current replication, go back to
> success LSN and repeat all messages from success LSN. I think it operation
> can be quite common, but reopen connection or not stopping replication
> will increase latency.
>

It will, but not tons. The biggest cost (at least if there are any long
running xacts) will be replaying since the restart_lsn when you restart the
decoding session, which will happen whether you reconnect or just stop
decoding and return to command mode.



> Anyway, here's a draft patch that does the parts other than the reorder
>> buffer processing stop. It passes 'make check' and the src/test/recovery
>> tests, but I haven't written anything to verify the client-initiated abort
>> handling. You have test code for this and I'd be interested to see the
>> results.
>>
>
> What about keepAlive package when we already send/receive CopyDone? Is It
> really necessary?
>

No, I don't think it is, and I applied the change you made to suppress it.


> I think it's worth making the next step, where you allow reorder buffer
>> commit processing to be interrupted, into a separate patch on top of this
>> one. They're two separate changes IMO.
>>
>
> We will continue in the current thread, or new? I interesting in both
> patch for my solution and pgjbc driver.
>

Same thread, I just think these are two somewhat separate changes. One is
just in the walsender and allows return to command mode during waiting for
WAL. The other is more intrusive into the reorder buffer etc and allows
aborting decoding during commit processing. So two separate patches make
sense here IMO, one on top of the other.

I use git and just "git format-patch -2" to prepare a stack of two patches
from two separate commits.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


[HACKERS] Hash Indexes

2016-05-10 Thread Amit Kapila
For making hash indexes usable in production systems, we need to improve
its concurrency and make them crash-safe by WAL logging them.  The first
problem I would like to tackle is improve the concurrency of hash
indexes.  First
advantage, I see with improving concurrency of hash indexes is that it has
the potential of out performing btree for "equal to" searches (with my WIP
patch attached with this mail, I could see hash index outperform btree
index by 20 to 30% for very simple cases which are mentioned later in this
e-mail).   Another advantage as explained by Robert [1] earlier is that if
we remove heavy weight locks under which we perform arbitrarily large
number of operations, it can help us to sensibly WAL log it.  With this
patch, I would also like to make hash indexes capable of completing the
incomplete_splits which can occur due to interrupts (like cancel) or errors
or crash.

I have studied the concurrency problems of hash index and some of the
solutions proposed for same previously and based on that came up with below
solution which is based on idea by Robert [1], community discussion on
thread [2] and some of my own thoughts.

Maintain a flag that can be set and cleared on the primary bucket page,
call it split-in-progress, and a flag that can optionally be set on
particular index tuples, call it moved-by-split. We will allow scans of all
buckets and insertions into all buckets while the split is in progress, but
(as now) we will not allow more than one split for a bucket to be in
progress at the same time.  We start the split by updating metapage to
incrementing the number of buckets and set the split-in-progress flag in
primary bucket pages for old and new buckets (lets number them as old
bucket - N+1/2; new bucket - N + 1 for the matter of discussion). While the
split-in-progress flag is set, any scans of N+1 will first scan that
bucket, ignoring any tuples flagged moved-by-split, and then ALSO scan
bucket N+1/2. To ensure that vacuum doesn't clean any tuples from old or
new buckets till this scan is in progress, maintain a pin on both of the
buckets (first pin on old bucket needs to be acquired). The moved-by-split
flag never has any effect except when scanning the new bucket that existed
at the start of that particular scan, and then only if
the split-in-progress flag was also set at that time.

Once the split operation has set the split-in-progress flag, it will begin
scanning bucket (N+1)/2.  Every time it finds a tuple that properly belongs
in bucket N+1, it will insert the tuple into bucket N+1 with the
moved-by-split flag set.  Tuples inserted by anything other than a split
operation will leave this flag clear, and tuples inserted while the split
is in progress will target the same bucket that they would hit if the split
were already complete.  Thus, bucket N+1 will end up with a mix
of moved-by-split tuples, coming from bucket (N+1)/2, and unflagged tuples
coming from parallel insertion activity.  When the scan of bucket (N+1)/2
is complete, we know that bucket N+1 now contains all the tuples that are
supposed to be there, so we clear the split-in-progress flag on both
buckets.  Future scans of both buckets can proceed normally.  Split
operation needs to take a cleanup lock on primary bucket to ensure that it
doesn't start if there is any Insertion happening in the bucket.  It will
leave the lock on primary bucket, but not pin as it proceeds for next
overflow page.  Retaining pin on primary bucket will ensure that vacuum
doesn't start on this bucket till the split is finished.

Insertion will happen by scanning the appropriate bucket and needs to
retain pin on primary bucket to ensure that concurrent split doesn't
happen, otherwise split might leave this tuple unaccounted.

Now for deletion of tuples from (N+1/2) bucket, we need to wait for the
completion of any scans that began before we finished populating bucket
N+1, because otherwise we might remove tuples that they're still expecting
to find in bucket (N+1)/2. The scan will always maintain a pin on primary
bucket and Vacuum can take a buffer cleanup lock (cleanup lock includes
Exclusive lock on bucket and wait till all the pins on buffer becomes zero)
on primary bucket for the buffer.  I think we can relax the requirement for
vacuum to take cleanup lock (instead take Exclusive Lock on buckets where
no split has happened) with the additional flag has_garbage which will be
set on primary bucket, if any tuples have been moved from that bucket,
however I think for squeeze phase (in this phase, we try to move the tuples
from later overflow pages to earlier overflow pages in the bucket and then
if there are any empty overflow pages, then we move them to kind of a free
pool) of vacuum, we need a cleanup lock, otherwise scan results might get
effected.

Incomplete Splits
--
Incomplete splits can be completed either by vacuum or insert as both needs
exclusive lock on bucket.  If vacuum finds split-in-progress 

Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Robert Haas
On Tue, May 10, 2016 at 3:05 AM, Andres Freund  wrote:
> The easy way to trigger this problem would be to have an oid wraparound
> - but the WAL shows that that's not the case here.  I've not figured
> that one out entirely (and won't tonight). But I do see WAL records
> like:
> rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> 2/12004018, prev 2/12003288, desc: NEXTOID 4302693
> rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> 2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
> i.e. two NEXTOID records allocating the same range, which obviously
> doesn't seem right.  There's also every now and then close by ranges:
> rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
> 1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455
> rmgr: XLOGlen (rec/tot):  4/30, tx:7814505, lsn: 
> 1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461
>
>
> As far as I can see something like the above, or an oid wraparound, are
> pretty much deadly for toast.
>
> Is anybody ready with a good defense for SatisfiesToast not doing any
> actual liveliness checks?

I assume that this was installed as a performance optimization, and I
don't really see why it shouldn't be or be able to be made safe.  I
assume that the wraparound case was deemed safe because at that time
the idea of 4 billion OIDs getting used with old transactions still
active seemed inconceivable.  It seems to me that the real question
here is how you're getting two calls to XLogPutNextOid() with the same
value of ShmemVariableCache->nextOid, and the answer, as it seems to
me, must be that LWLocks are broken.  Either two processes are
managing to hold OidGenLock in exclusive mode at the same time, or
they're acquiring it in quick succession but without the second
process seeing all of the updates performed by the first process.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Rajeev rastogi
On 09 May 2016 23:04, Robert Haas Wrote:

>2. vectorized execution, by which I mean the ability of a node to return
>tuples in batches rather than one by one.  Andres has opined more than
>once that repeated trips through ExecProcNode defeat the ability of the
>CPU to do branch prediction correctly, slowing the whole system down,
>and that they also result in poor CPU cache behavior, since we jump all
>over the place executing a little bit of code from each node before
>moving onto the next rather than running one bit of code first, and then
>another later.  I think that's
>probably right.   For example, consider a 5-table join where all of
>the joins are implemented as hash tables.  If this query plan is going
>to be run to completion, it would make much more sense to fetch, say,
>100 tuples from the driving scan and then probe for all of those in the
>first hash table, and then probe for all of those in the second hash
>table, and so on.  What we do instead is fetch one tuple and probe for
>it in all 5 hash tables, and then repeat.  If one of those hash tables
>would fit in the CPU cache but all five together will not,
>that seems likely to be a lot worse.   But even just ignoring the CPU
>cache aspect of it for a minute, suppose you want to write a loop to
>perform a hash join.  The inner loop fetches the next tuple from the
>probe table and does a hash lookup.  Right now, fetching the next tuple
>from the probe table means calling a function which in turn calls
>another function which probably calls another function which probably
>calls another function and now about 4 layers down we actually get the
>next tuple.  If the scan returned a batch of tuples to the hash join,
>fetching the next tuple from the batch would probably be 0 or 1 function
>calls rather than ... more.  Admittedly, you've got to consider the cost
>of marshaling the batches but I'm optimistic that there are cycles to be
>squeezed out here.  We might also want to consider storing batches of
>tuples in a column-optimized rather than row-optimized format so that
>iterating through one or two attributes across every tuple in the batch
>touches the minimal number of cache lines.


This sounds to be really great idea in the direction of performance improvement.
I would like to share my thought as per our research work in the similar area 
(Mostly it may be as you have mentioned).
Our goal with this work was to:
1. Makes the processing data-centric instead of operator centric.
2. Instead of pulling each tuple from immediate operator, operator can push the 
tuple to its parent. It can be allowed to push until it sees any operator, 
which cannot be processed without result from other operator.   
3. Above two points to make better data-locality.

e.g. we had done some quick prototyping (take it just as thought provoker) as 
mentioned below:
Query: select * from tbl1, tbl2, tbl3 where tbl1.a=tbl2.b and tbl2.b=tbl3.c;
For hash join:
For each tuple t2 of tbl2
Materialize a hash tbl1.a = tbl2.b

For each tuple t3 of tbl3
Materialize a hash tbl2.b = tbl3.c

for each tuple t1 of tbl1
Search in hash  tbl1.a = tbl2.b
Search in hash tbl2.b = tbl3.c
Output t1*t2*t3

Off course at each level if there is any additional Qual for the table, same 
can be applied. 

Similarly for Nested Loop Join, plan tree can be processed in the form of 
post-order traversal of tree.
Scan first relation (leftmost relation), store all tuple --> Outer
Loop through all scan (Or some part of total tuples)node relation starting from 
second one
Scan the current relation
For each tuple, match with all tuples of outer result, build the 
combined tuple.
Save all combined satisfied tuple --> Outer

The result we got was really impressive.

There is a paper by Thomas Neumann on this idea: 
http://www.vldb.org/pvldb/vol4/p539-neumann.pdf 

Note: VitesseDB has also implemented this approach for Hash Join along with 
compilation and their result is really impressive.

Thanks and Regards,
Kumar Rajeev Rastogi.
http://rajeevrastogi.blogspot.in/   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Stopping logical replication protocol

2016-05-10 Thread Vladimir Gordiychuk
> Fair enough. Though I don't understand why you'd be doing this often
> enough that you'd care about reopening connections. What is the problem you
> are trying to solve with this? The underlying reason you need this change?
>

First reason it clear API in pgjdc. Second reason it ability fast enough
rollack to one of the previous LSN and repeat it. My use case for second
reason - I have logical decoding extension that prepare only data as
key-value pair without information about (insert, update, delete) for
example if it delete as key I use primary key for table and as value null. Via
pgjdc by replication protocol connects receiver data, consumer group
changes to batch and send it to Kafka. If some problem occurs during
delivery to kafka consumer, I should stop current replication, go back to
success LSN and repeat all messages from success LSN. I think it operation
can be quite common, but reopen connection or not stopping replication will
increase latency.


Anyway, here's a draft patch that does the parts other than the reorder
> buffer processing stop. It passes 'make check' and the src/test/recovery
> tests, but I haven't written anything to verify the client-initiated abort
> handling. You have test code for this and I'd be interested to see the
> results.
>

What about keepAlive package when we already send/receive CopyDone? Is It
really necessary?

I measure current fix without include long transaction, only start
replication and stop it, when on postgresql absent active transactions and
get next results:

*Before:*
12:35:31.403 (2)  FE=> StartReplication(query: START_REPLICATION SLOT
pgjdbc_logical_replication_slot LOGICAL 0/7287AC00 ("include-xids" 'false',
"skip-empty-xacts" 'true'))
12:35:31.403 (2)  FE=> Query(CopyStart)
12:35:31.404 (2)  <=BE CopyBothResponse
12:35:31.406 (2)  FE=> StopReplication
12:35:31.406 (2)  FE=> CopyDone
12:35:31.406 (2)  <=BE CopyData
12:35:31.407 (2) [107, 0, 0, 0, 0, 114, -121, -84, 0, 0, 1, -43, 120, 106,
53, -85, 21, 0]
12:35:31.407 (2)  <=BE CopyDone
12:35:31.407 (2)  <=BE CopyData
12:35:31.407 (2) [107, 0, 0, 0, 0, 114, -121, -84, 0, 0, 1, -43, 120, 106,
53, -76, 29, 0]
12:35:52.120 (2)  <=BE CommandStatus(COPY 0)
12:35:52.120 (2)  <=BE CommandStatus(SELECT)
12:35:52.120 (2)  <=BE ReadyForQuery(I)
12:35:52.126 (2)  FE=> Terminate

*Timing:*
Start and stopping time: 20729ms
Stopping time: 20720ms

*After:*
14:30:02.050 (2)  FE=> StartReplication(query: START_REPLICATION SLOT
pgjdbc_logical_replication_slot LOGICAL 0/728A06C0 ("include-xids" 'false',
"skip-empty-xacts" 'true'))
14:30:02.050 (2)  FE=> Query(CopyStart)
14:30:02.051 (2)  <=BE CopyBothResponse
14:30:02.056 (2)  FE=> StopReplication
14:30:02.057 (2)  FE=> CopyDone
14:30:02.057 (2)  <=BE CopyData
14:30:02.057 (2) [107, 0, 0, 0, 0, 114, -118, 6, -64, 0, 1, -43, 122, 3,
-69, 107, 76, 0]
14:30:02.058 (2)  <=BE CopyDone
14:30:02.058 (2)  <=BE CommandStatus(COPY 0)
14:30:02.058 (2)  <=BE CommandStatus(SELECT)
14:30:02.058 (2)  <=BE ReadyForQuery(I)
14:30:02.068 (2)  FE=> StandbyStatusUpdate(received: 0/728A06C0, flushed:
0/0, applied: 0/0, clock: Tue May 10 14:30:02 MSK 2016)

*Timing:*
Start and stopping time: 27ms
Stopping time: 11ms



So aborting processing and returning between individual changes in
> ReorderBufferCommit(...) seems very reasonable. I agree that some kind of
> callback is needed because the walsender uses static globals to control its
> send/receive stop logic. I don't like the naming "is_active"; maybe reverse
> the sense and call it "stop_decoding_cb" ? Or "continue_decoding_cb"?
> Unsure.


continue_decoding_cb sounds good.

I think it's worth making the next step, where you allow reorder buffer
> commit processing to be interrupted, into a separate patch on top of this
> one. They're two separate changes IMO.
>

We will continue in the current thread, or new? I interesting in both patch
for my solution and pgjbc driver.

2016-05-10 5:49 GMT+03:00 Craig Ringer :

> On 10 May 2016 at 09:50, Craig Ringer  wrote:
>
>
>> I outlined how I think WalSndWaitForWal() should be handled earlier.
>> Test streamingDoneReceiving and streamingDoneSending in 
>> logical_read_xlog_page
>> and return -1.
>>
>
> OK, so thinking about this some more, I see why you've added the callback
> within the reorder buffer code. You want to stop processing of a
> transaction after we've decoded the commit and are looping over the changes
> within ReorderBufferCommit(...), which doesn't know anything about the
> walsender. So we could loop for a long time within WalSndLoop ->
> XLogSendLogical -> LogicalDecodingProcessRecord if the record is a commit,
> as we process each change and send it to the client.
>
> So aborting processing and returning between individual changes in
> ReorderBufferCommit(...) seems very reasonable. I agree that some kind of
> callback is needed because the walsender uses static globals to control its
> send/receive stop logic. I don't like 

Re: [HACKERS] Declarative partitioning

2016-05-10 Thread Amit Langote

Hi Ashutosh,

On 2016/05/09 20:21, Ashutosh Bapat wrote:
> Hi Amit,
> I am trying multi-column/expression partitions.

Thanks for the tests.

> create table t1_multi_col (a int, b int) partition by range (a, b);
> create table t1_mc_p1 partition of t1_multi_col for values start (1, 200)
> end (100, 300);
> create table t1_mc_p2 partition of t1_multi_col for values start (200, 1)
> end (300, 100);
> insert into t1_multi_col values (1, 250);
> insert into t1_multi_col values (250, 1);
> insert into t1_multi_col values (100, 100);
> select tableoid::regclass, * from t1_multi_col;
>  tableoid |  a  |  b
> --+-+-
>  t1_mc_p1 |   1 | 250
>  t1_mc_p1 | 100 | 100
>  t1_mc_p2 | 250 |   1
> The row (100, 100) landed in t1_mc_p1 which has partition bounds as (1,
> 200) and (100, 300) which should not accept a row with b = 100. It looks
> like the binary search got confused with the reversed order of ranges
> (should that be allowed?)

It's useful to think of multi-column key as defining ranges of composite
values (tuples) instead of thinking in terms of ranges of values of
individual columns.  That's how a row's partition key is compared against
individual partition bounds until a suitable partition is found (with
binary search that is), which uses record comparison logic as shown below:

postgres=# select (1, 200) <= (100, 100) AND (100, 100) < (100, 300);
 ?column?
--
 t
(1 row)

Which means the row (100, 100) belongs in the partition with the start
bound (1, 200) and the end bound (100, 300).  Just like in composite value
case, comparison stops at some leading column that returns != 0 result.
So, in comparison (1, 200) <= (100, 100), the second column plays no role.

> Symantec of multiple columns for ranges (may be list as well) looks

Note that list partition key does not support multiple columns.

> confusing. The current scheme doesn't allow overlapping range for one of
> the partitioning keys even if the combined range is non-overlapping.
> create table t1_multi_col (a int, b int) partition by range (a, b);
> create table t1_mc_p1 partition of t1_multi_col for values start (1, 100)
> end (100, 200);
> create table t1_mc_p2 partition of t1_multi_col for values start (1, 200)
> end (100, 300);
> ERROR:  new partition's range overlaps with that of partition "t1_mc_p1" of
> "t1_multi_col"
> HINT:  Please specify a range that does not overlap with any existing
> partition's range.
> create table t1_mc_p2 partition of t1_multi_col for values start (1, 300)
> end (100, 400);
> ERROR:  new partition's range overlaps with that of partition "t1_mc_p1" of
> "t1_multi_col"
> HINT:  Please specify a range that does not overlap with any existing
> partition's range.

Ranges [ (1, 100), (100, 200) ) and [ (1, 200), (100, 300) ) do overlap:

postgres=# select (1, 100) <= (1, 200) AND (1, 200) < (100, 200);
 ?column?
--
 t
(1 row)

That is, (1, 200) is both the start element of the 2nd partition's range
and is contained in the first partition's range as illustrated above.

> That should be better realised using subpartitioning on b. The question is,
> if one column's value is enough to identify partition (since they can not
> contain overlapping values for that column), why do we need mutliple
> columns/expressions as partition keys? IIUC, all the other column does is
> to disallow certain range of values for that column, which can better be
> done by a CHECK constraint. It looks like Oracle looks at combined range
> and not just one column.

A more familiar example I have seen around the web illustrating
multi-column range partitioning is for something like (year, month, day)
triple.  Consider the following example:

create table parted(year int, month int, day int)
  partition by range (year, month, day);

create table part201605week1 partition of parted
  for values start (2016, 5, 1) end (2016, 5, 8);
create table part201605week2 partition of parted
  for values start (2016, 5, 8) end (2016, 5, 15);
create table part201605week3 partition of parted
  for values start (2016, 5, 15) end (2016, 5, 22);
create table part201605week4 partition of parted
  for values start (2016, 5, 22) end (2016, 5, 29);
create table part201605week5 partition of parted
  for values start (2016, 5, 29) end (2016, 5, 31) inclusive;

create table part201606week1 partition of parted
  for values start (2016, 6, 1) end (2016, 6, 8);
create table part201606week2 partition of parted
  for values start (2016, 6, 8) end (2016, 6, 15);
create table part201606week3 partition of parted
  for values start (2016, 6, 15) end (2016, 6, 22);
create table part201606week4 partition of parted
  for values start (2016, 6, 2) end (2016, 6, 29);
create table part201606week4 partition of parted
  for values start (2016, 6, 22) end (2016, 6, 29);
create table part201606week5 partition of parted
  for values start (2016, 6, 29) end (2016, 6, 30) inclusive;

explain (costs off) select * from parted where day between 4 and 10;
 

Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Simon Riggs
On 10 May 2016 at 09:05, Andres Freund  wrote:


> Is anybody ready with a good defense for SatisfiesToast not doing any
> actual liveliness checks?
>

I provided a patch earlier that rechecks the OID fetched from a toast chunk
matches the OID requested.

I didn't commit it, I just used it to check the patch which changed btree
vacuum replay.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


toast_recheck.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 9 May 2016 13:33:55 -0400, Robert Haas  wrote in 

> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:

> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.

This is my main concern and what I wanted to solve.

> It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.

Especially for foreign tables, there must be gaps between sending
FETCH and getting the result. Visiting other tables is very
effective to fill the gaps. Using file descriptors is greatly
helps this in effective way, thanks to the new API
WaitEventSet. The attached is a WiP of PoC (sorry for including
some debug code and irrelevant code) of that. It is a bit
different in Exec* APIs from the 0002 patch but works even only
for postgres-fdw and append. It embeds waiting code into
ExecAppend but easily replaceable with the framework in the
Robert's 0003 patch.

Apart from the core part, for postgres-fdw, some scans resides
together on one connection. These scans share the same FD but
there's no means to identify for which scan-node the fd is
signalled. To handle the situation, we might need 'seemed to be
ready but really not' route.

> For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.

One annoyance of this method is one FD with latch-like data
drain. Since we should provide FDs for such nodes, gather would
may have another data-passing channel on the FDs.

And I want to realize early-execution of async nodes. This might
need that all types of node return 'not-ready' for the first call
even if it is async-capable.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.   But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join.  The inner loop fetches the next tuple from the
> probe table and does a hash lookup.  Right now, fetching the next
> tuple from the probe table means 

Re: [HACKERS] between not propated into a simple equality join

2016-05-10 Thread Benedikt Grundmann
On Tue, May 10, 2016 at 7:41 AM, David Rowley 
wrote:

> On 10 May 2016 at 16:34, David G. Johnston 
> wrote:
> > On Mon, May 9, 2016 at 8:53 AM, Benedikt Grundmann
> >  wrote:
> >>
> >> We just run into a very simple query that the planner does much worse on
> >> than we thought it would (in production the table in question is ~ 100
> GB).
> >> It surprised us given the planner is generally quite good, so I thought
> I
> >> share our surprise
> >>
> >> Setup:
> >>
> >> postgres_prod@proddb_testing=# select version();[1]
> >> version
> >>
> >>
> 
> >>  PostgreSQL 9.2.16 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
> >> 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit
> >> (1 row)
> >>
> >> Time: 69.246 ms
> >>
> >> postgres_prod@proddb_testing=# create table toy_data3 (the_date date, i
> >> int);
> >> CREATE TABLE
> >> Time: 67.096 ms
> >> postgres_prod@proddb_testing=# insert into toy_data3
> >>   (select current_date-(s.idx/1000), s.idx from
> generate_series(1,100)
> >> as s(idx));
> >> INSERT 0 100
> >> Time: 1617.483 ms
> >> postgres_prod@proddb_testing=# create index toy_data_date3 on
> >> toy_data3(the_date);
> >> CREATE INDEX
> >> Time: 660.166 ms
> >> postgres_prod@proddb_testing=# analyze toy_data3;
> >> ANALYZE
> >> Time: 294.984 ms
> >>
> >> The bad behavior:
> >>
> >> postgres_prod@proddb_testing=# explain analyze
> >>   select * from (
> >>select td1.the_date, td1.i
> >> from toy_data3 td1, toy_data3 td2  where td1.the_date = td2.the_date
> >> and td1.i = td2.i
> >>   ) foo
> >>   where the_date between current_date and current_date;
> >>QUERY
> >> PLAN
> >>
> >>
> ───
> >>  Hash Join  (cost=55.49..21980.50 rows=1 width=8) (actual
> >> time=0.336..179.374 rows=999 loops=1)
> >>Hash Cond: ((td2.the_date = td1.the_date) AND (td2.i = td1.i))
> >>->  Seq Scan on toy_data3 td2  (cost=0.00..14425.00 rows=100
> >> width=8) (actual time=0.007..72.510 rows=100 lo
> >>->  Hash  (cost=40.44..40.44 rows=1003 width=8) (actual
> >> time=0.321..0.321 rows=999 loops=1)
> >>  Buckets: 1024  Batches: 1  Memory Usage: 40kB
> >>  ->  Index Scan using toy_data_date3 on toy_data3 td1
> >> (cost=0.01..40.44 rows=1003 width=8) (actual time=0.018.
> >>Index Cond: ((the_date >= ('now'::cstring)::date) AND
> >> (the_date <= ('now'::cstring)::date))
> >>  Total runtime: 179.440 ms
> >> (8 rows)
> >>
> >> Time: 246.094 ms
> >>
> >> Notice the red.  Which is sad because one would like it to realize that
> it
> >> could propagate the index constraint onto td2.  That is on both sides
> of the
> >> join do the green.
> >>
> >
> > FWIW
> >
> > This is my plan result:
> > version
> > PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> > 4.8.2-19ubuntu1) 4.8.2, 64-bit
> > All default settings
> >
> > using "BETWEEN"
> >
> >  QUERY PLAN
> > Nested Loop  (cost=0.86..48.91 rows=1 width=8) (actual
> time=0.042..168.512
> > rows=999 loops=1)
> >   ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..8.46
> > rows=1 width=8) (actual time=0.022..1.388 rows=999 loops=1)
> > Index Cond: ((the_date >= ('now'::cstring)::date) AND (the_date
> <=
> > ('now'::cstring)::date))
> >   ->  Index Scan using toy_data_date3 on toy_data3 td2  (cost=0.42..40.44
> > rows=1 width=8) (actual time=0.078..0.160 rows=1 loops=999)
> > Index Cond: (the_date = td1.the_date)
> > Filter: (td1.i = i)
> > Rows Removed by Filter: 998
> > Planning time: 0.353 ms
> > Execution time: 169.692 ms
> >
> > using "="
> >
> > QUERY PLAN
> > Hash Join  (cost=49.89..90.46 rows=1 width=8) (actual time=2.320..5.652
> > rows=999 loops=1)
> >   Hash Cond: (td1.i = td2.i)
> >   ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..37.37
> > rows=967 width=8) (actual time=0.014..1.168 rows=999 loops=1)
> > Index Cond: (the_date = ('now'::cstring)::date)
> >   ->  Hash  (cost=37.37..37.37 rows=967 width=8) (actual
> time=2.292..2.292
> > rows=999 loops=1)
> > Buckets: 1024  Batches: 1  Memory Usage: 48kB
> > ->  Index Scan using toy_data_date3 on toy_data3 td2
> > (cost=0.43..37.37 rows=967 width=8) (actual time=0.008..1.183 rows=999
> > loops=1)
> >   Index Cond: (the_date = ('now'::cstring)::date)
> > Planning time: 0.326 ms
> > Execution time: 6.673 ms
> >
> > I was hoping to be able to say more but alas cannot find the words.
> >
> > I'm surprised by the estimate of 1 rows for the td1 index scan in my 9.5
> > query - and also why the 9.2 query would choose to sequential scan hash

Re: [HACKERS] what to revert

2016-05-10 Thread Kevin Grittner
On Mon, May 9, 2016 at 9:01 PM, Tomas Vondra 
wrote:

> Over the past few days I've been running benchmarks on a fairly
> large NUMA box (4 sockets, 32 cores / 64 with HR, 256GB of RAM)
> to see the impact of the 'snapshot too old' - both when disabled
> and enabled with various values in the old_snapshot_threshold
> GUC.

Thanks!

> The benchmark is a simple read-only pgbench with prepared
> statements, i.e. doing something like this:
>
>pgbench -S -M prepared -j N -c N

Do you have any plans to benchmark cases where the patch can have a
benefit?  (Clearly, nobody would be interested in using the feature
with a read-only load; so while that makes a good "worst case"
scenario and is very valuable for testing the "off" versus
"reverted" comparison, it's not an intended use or one that's
likely to happen in production.)

> master-10-new - 91fd1df4 + old_snapshot_threshold=10
> master-10-new-2 - 91fd1df4 + old_snapshot_threshold=10 (rerun)

So, these runs were with identical software on the same data?  Any
differences are just noise?

> * The results are a bit noisy, but I think in general this shows
> that for certain cases there's a clearly measurable difference
> (up to 5%) between the "disabled" and "reverted" cases. This is
> particularly visible on the smallest data set.

In some cases, the differences are in favor of disabled over
reverted.

> * What's fairly strange is that on the largest dataset (scale
> 1), the "disabled" case is actually consistently faster than
> "reverted" - that seems a bit suspicious, I think. It's possible
> that I did the revert wrong, though - the revert.patch is
> included in the tgz. This is why I also tested 689f9a05, but
> that's also slower than "disabled".

Since there is not a consistent win of disabled or reverted over
the other, and what difference there is is often far less than the
difference between the two runs with identical software, is there
any reasonable interpretation of this except that the difference is
"in the noise"?

> * The performance impact with the feature enabled seems rather
> significant, especially once you exceed the number of physical
> cores (32 in this case). Then the drop is pretty clear - often
> ~50% or more.
>
> * 7e3da1c4 claims to bring the performance within 5% of the
> disabled case, but that seems not to be the case.

The commit comment says "At least in the tested case this brings
performance within 5% of when the feature is off, compared to
several times slower without this patch."  The tested case was a
read-write load, so your read-only tests do nothing to determine
whether this was the case in general for this type of load.
Partly, the patch decreases chasing through HOT chains and
increases the number of HOT updates, so there are compensating
benefits of performing early vacuum in a read-write load.

> What it however does is bringing the 'non-immediate' cases close
> to the immediate ones (before the performance drop came much
> sooner in these cases - at 16 clients).

Right.  This is, of course, just the first optimization, that we
were able to get in "under the wire" before beta, but the other
optimizations under consideration would only tend to bring the
"enabled" cases closer together in performance, not make an enabled
case perform the same as when the feature was off -- especially for
a read-only workload.

> * It's also seems to me the feature greatly amplifies the
> variability of the results, somehow. It's not uncommon to see
> results like this:
>
>  master-10-new-2235516 331976133316155563133396
>
> where after the first runs (already fairly variable) the
> performance tanks to ~50%. This happens particularly with higher
> client counts, otherwise the max-min is within ~5% of the max.
> There are a few cases where this happens without the feature
> (i.e. old master, reverted or disabled), but it's usually much
> smaller than with it enabled (immediate, 10 or 60). See the
> 'summary' sheet in the ODS spreadsheet.
>
> I don't know what's the problem here - at first I thought that
> maybe something else was running on the machine, or that
> anti-wraparound autovacuum kicked in, but that seems not to be
> the case. There's nothing like that in the postgres log (also
> included in the .tgz).

I'm inclined to suspect NUMA effects.  It would be interesting to
try with the NUMA patch and cpuset I submitted a while back or with
fixes in place for the Linux scheduler bugs which were reported
last month.  Which kernel version was this?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Use %u to print user mapping's umid and userid

2016-05-10 Thread Etsuro Fujita

On 2016/05/02 22:06, Robert Haas wrote:

On Thu, Apr 28, 2016 at 7:59 AM, Etsuro Fujita
 wrote:

On 2016/03/14 17:56, Ashutosh Bapat wrote:

On Mon, Mar 14, 2016 at 1:29 PM, Etsuro Fujita
> wrote:



 /*
  * Build the fdw_private list that will be available to the
executor.
  * Items in the list must match order in enum
FdwScanPrivateIndex.
  */
 fdw_private = list_make4(makeString(sql.data),
  retrieved_attrs,
  makeInteger(fpinfo->fetch_size),
  makeInteger(foreignrel->umid));

I don't think it's correct to use makeInteger for the foreignrel's
umid.



As long as we are using makeInteger() and inVal() pair to set and
extract the values, it should be fine.



Yeah, but my concern about this is eg, print plan if debugging (ie,
debug_print_plan=on); the umid OID will be printed with the %ld specifier,
so in some platform, the OID might be printed wrongly.  Maybe I'm missing
something, though.



That seems like a legitimate, if minor, complaint.


Here is a patch to fix this.  That is basically the same as in [1], but 
I rebased the patch against HEAD and removed list_make5 and its friends, 
which were added just for the postgres_fdw DML pushdown.


Sorry for the delay.  I was on vacation.

Best regards,
Etsuro Fujita

[1] http://www.postgresql.org/message-id/56e66f61.3070...@lab.ntt.co.jp
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 67,74  enum FdwScanPrivateIndex
  	FdwScanPrivateRetrievedAttrs,
  	/* Integer representing the desired fetch_size */
  	FdwScanPrivateFetchSize,
- 	/* Oid of user mapping to be used while connecting to the foreign server */
- 	FdwScanPrivateUserMappingOid,
  
  	/*
  	 * String describing join i.e. names of relations being joined and types
--- 67,72 
***
*** 1198,1208  postgresGetForeignPlan(PlannerInfo *root,
  	 * Build the fdw_private list that will be available to the executor.
  	 * Items in the list must match order in enum FdwScanPrivateIndex.
  	 */
! 	fdw_private = list_make5(makeString(sql.data),
  			 remote_conds,
  			 retrieved_attrs,
! 			 makeInteger(fpinfo->fetch_size),
! 			 makeInteger(foreignrel->umid));
  	if (foreignrel->reloptkind == RELOPT_JOINREL)
  		fdw_private = lappend(fdw_private,
  			  makeString(fpinfo->relation_name->data));
--- 1196,1205 
  	 * Build the fdw_private list that will be available to the executor.
  	 * Items in the list must match order in enum FdwScanPrivateIndex.
  	 */
! 	fdw_private = list_make4(makeString(sql.data),
  			 remote_conds,
  			 retrieved_attrs,
! 			 makeInteger(fpinfo->fetch_size));
  	if (foreignrel->reloptkind == RELOPT_JOINREL)
  		fdw_private = lappend(fdw_private,
  			  makeString(fpinfo->relation_name->data));
***
*** 1234,1240  postgresBeginForeignScan(ForeignScanState *node, int eflags)
--- 1231,1241 
  	ForeignScan *fsplan = (ForeignScan *) node->ss.ps.plan;
  	EState	   *estate = node->ss.ps.state;
  	PgFdwScanState *fsstate;
+ 	RangeTblEntry *rte;
+ 	Oid			userid;
+ 	ForeignTable *table;
  	UserMapping *user;
+ 	int			rtindex;
  	int			numParams;
  
  	/*
***
*** 1250,1285  postgresBeginForeignScan(ForeignScanState *node, int eflags)
  	node->fdw_state = (void *) fsstate;
  
  	/*
! 	 * Obtain the foreign server where to connect and user mapping to use for
! 	 * connection. For base relations we obtain this information from
! 	 * catalogs. For join relations, this information is frozen at the time of
! 	 * planning to ensure that the join is safe to pushdown. In case the
! 	 * information goes stale between planning and execution, plan will be
! 	 * invalidated and replanned.
  	 */
  	if (fsplan->scan.scanrelid > 0)
! 	{
! 		ForeignTable *table;
! 
! 		/*
! 		 * Identify which user to do the remote access as.  This should match
! 		 * what ExecCheckRTEPerms() does.
! 		 */
! 		RangeTblEntry *rte = rt_fetch(fsplan->scan.scanrelid, estate->es_range_table);
! 		Oid			userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
! 
! 		fsstate->rel = node->ss.ss_currentRelation;
! 		table = GetForeignTable(RelationGetRelid(fsstate->rel));
! 
! 		user = GetUserMapping(userid, table->serverid);
! 	}
  	else
! 	{
! 		Oid			umid = intVal(list_nth(fsplan->fdw_private, FdwScanPrivateUserMappingOid));
  
! 		user = GetUserMappingById(umid);
! 		Assert(fsplan->fs_server == user->serverid);
! 	}
  
  	/*
  	 * Get connection to the foreign server.  Connection manager will
--- 1251,1274 
  	node->fdw_state = (void *) fsstate;
  
  	/*
! 	 * Get the user mapping.  This should match what ExecCheckRTEPerms() does.
! 	 *
! 	 * If scanning a foreign join, the planner ensured that joined relations
! 	 * are foreign tables 

Re: [HACKERS] HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)

2016-05-10 Thread Andres Freund
Hi,

On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> I've bisected the errors I was seeing, discussed in
> http://www.postgresql.org/message-id/CAMkU=1xqehc0ok4d+tkjfq1nvuho37pyrkhjp6q8oxifmx7...@mail.gmail.com
>
> It look like they first appear in:
>
> commit 48354581a49c30f5757c203415aa8412d85b0f70
> Author: Andres Freund 
> Date:   Sun Apr 10 20:12:32 2016 -0700
>
> Allow Pin/UnpinBuffer to operate in a lockfree manner.
>
>
> I get the errors:
>
> ERROR:  attempted to delete invisible tuple
> STATEMENT:  update foo set count=count+1,text_array=$1 where text_array @> $2
>
> And also:
>
> ERROR:  unexpected chunk number 1 (expected 2) for toast value
> 85223889 in pg_toast_16424
> STATEMENT:  update foo set count=count+1 where text_array @> $1

I'm a bit dumbfounded here. I think I found the issue, and, if so, it's
*very* longstanding. HeapTupleSatisfiesToast(), aborted transactions,
and hint bits (or oid wraparound) appear to be a broken combination.

What I'm seing is that tuptoaster.c is trying to fetch or delete toast
chunks generated in an aborted (via crash) transaction. And that, with
the same valueid/chunk_id, a valid value exist somewhere else in the
toast table.

HeapTupleSatisfiesToast() does not do any actual tuple liveliness
detection besides checking basic hint bits. Thus, if there's a dead
tuple in the toast table's heap, which a *live* reference from the
table, toast_fetch_datum() potentially returns the wrong results.

That makes the new question: How can there be a live reference to a dead
and reused toast id?  As far as I can see the primary problem is that
GetNewOidWithIndex() uses SnapshotDirty to verify whether a tuple is
live.  As that actually verifies tuple liveliness (and sets hint bits),
it'll skip over an aborted toasted datum.  A valid question is why the
hint bits set by HeapTupleSatisfiesDirty() doesn't prevent that from
occuring (besides standbys which aren't evolved): It's the constant
crashes.  Looking at a tuple which triggers the "unexpected chunk
number" error, it indeed has a live and a dead version of the same chunk
id.

Interestingly the dead rows looks like:
chunk_id = 1 - t_infomask: 0xA02 - HEAP_XMAX_INVALID | HEAP_XMIN_INVALID | 
HEAP_HASVARWIDTH
chunk_id = 2 - t_infomask: 0x802 - HEAP_XMAX_INVALID | HEAP_HASVARWIDTH

My rather strong suspicion on why that is, is that both tuples were
visited by SnapshotDirty(), but only the first ones hint bits got
persisted, because checksums were enabled. The second time through
MarkBufferDirtyHint skipped work, because the buffer was already dirty.

rmgr: XLOGlen (rec/tot):  0/  3809, tx:  206127725, lsn: 
1C/EB3B0910, prev 1C/EB3B08E0, desc: FPI_FOR_HINT , blkref #0: rel 
1663/16384/16433 blk 202567 FPW
that's indeed the only time the affected page is touched as visible to
pg_xlogdump after being filled...


The easy way to trigger this problem would be to have an oid wraparound
- but the WAL shows that that's not the case here.  I've not figured
that one out entirely (and won't tonight). But I do see WAL records
like:
rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
2/12004018, prev 2/12003288, desc: NEXTOID 4302693
rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
2/1327EA08, prev 2/1327DC60, desc: NEXTOID 4302693
i.e. two NEXTOID records allocating the same range, which obviously
doesn't seem right.  There's also every now and then close by ranges:
rmgr: XLOGlen (rec/tot):  4/30, tx:  0, lsn: 
1/9A404DB8, prev 1/9A404270, desc: NEXTOID 3311455
rmgr: XLOGlen (rec/tot):  4/30, tx:7814505, lsn: 
1/9A4EC888, prev 1/9A4EB9D0, desc: NEXTOID 3311461


As far as I can see something like the above, or an oid wraparound, are
pretty much deadly for toast.

Is anybody ready with a good defense for SatisfiesToast not doing any
actual liveliness checks?


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread konstantin knizhnik
Hi,

> 1. asynchronous execution,


It seems to me that asynchronous execution can be considered as alternative to 
multithreading model (in case of PostgreSQL the roles of threads are played by 
workers).
Async. operations are used to have smaller overhead but have scalability 
problems (because in most implementation of cooperative multitasking there is 
just one processing thread and so it can not consume more than one core).

So I wonder whether asynchronous execution is trying to achieve that same goal 
as parallel query execution but using slightly different mechanism.
You wrote: 
> in the meantime, any worker that arrives at that scan node has no choice but 
> to block.

What's wrong with it that worker is blocked? You can just have more workers 
(more than CPU cores) to let other of them continue to do useful work.
But I agree that 
> Whether or not this will be efficient is a research question




> 2. vectorized execution

Vector IO is very important for columnar store. In IMCS extension (in-memory 
columnar store) using vector operations allows to increase speed 10-100 times 
depending on size of data set and query. Obviously the best results are for 
grand aggregates.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation 
native code for query execution plan (which is not so difficult now, thanks to 
LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL. Instead 
of native code generation I used conversion of query plans to C code and 
experiment with different data representation. "Horisontal model" with loading 
columns on demands shows better performance than columnar store.

As far as I know native code generator is currently developed for PostgreSQL by 
ISP RAN 
Sorry, slides in Russian:
https://pgconf.ru/media/2016/02/19/6%20Мельник%20Дмитрий%20Михайлович,%2005-02-2016.pdf

At this moment (February) them have implemented translation of only few 
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25% increase 
at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, 
because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code generation, 
but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open 
source...



On May 9, 2016, at 8:33 PM, Robert Haas wrote:

> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
> 
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in 

Re: [HACKERS] between not propated into a simple equality join

2016-05-10 Thread David Rowley
On 10 May 2016 at 16:34, David G. Johnston  wrote:
> On Mon, May 9, 2016 at 8:53 AM, Benedikt Grundmann
>  wrote:
>>
>> We just run into a very simple query that the planner does much worse on
>> than we thought it would (in production the table in question is ~ 100 GB).
>> It surprised us given the planner is generally quite good, so I thought I
>> share our surprise
>>
>> Setup:
>>
>> postgres_prod@proddb_testing=# select version();[1]
>> version
>>
>> 
>>  PostgreSQL 9.2.16 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
>> 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit
>> (1 row)
>>
>> Time: 69.246 ms
>>
>> postgres_prod@proddb_testing=# create table toy_data3 (the_date date, i
>> int);
>> CREATE TABLE
>> Time: 67.096 ms
>> postgres_prod@proddb_testing=# insert into toy_data3
>>   (select current_date-(s.idx/1000), s.idx from generate_series(1,100)
>> as s(idx));
>> INSERT 0 100
>> Time: 1617.483 ms
>> postgres_prod@proddb_testing=# create index toy_data_date3 on
>> toy_data3(the_date);
>> CREATE INDEX
>> Time: 660.166 ms
>> postgres_prod@proddb_testing=# analyze toy_data3;
>> ANALYZE
>> Time: 294.984 ms
>>
>> The bad behavior:
>>
>> postgres_prod@proddb_testing=# explain analyze
>>   select * from (
>>select td1.the_date, td1.i
>> from toy_data3 td1, toy_data3 td2  where td1.the_date = td2.the_date
>> and td1.i = td2.i
>>   ) foo
>>   where the_date between current_date and current_date;
>>QUERY
>> PLAN
>>
>> ───
>>  Hash Join  (cost=55.49..21980.50 rows=1 width=8) (actual
>> time=0.336..179.374 rows=999 loops=1)
>>Hash Cond: ((td2.the_date = td1.the_date) AND (td2.i = td1.i))
>>->  Seq Scan on toy_data3 td2  (cost=0.00..14425.00 rows=100
>> width=8) (actual time=0.007..72.510 rows=100 lo
>>->  Hash  (cost=40.44..40.44 rows=1003 width=8) (actual
>> time=0.321..0.321 rows=999 loops=1)
>>  Buckets: 1024  Batches: 1  Memory Usage: 40kB
>>  ->  Index Scan using toy_data_date3 on toy_data3 td1
>> (cost=0.01..40.44 rows=1003 width=8) (actual time=0.018.
>>Index Cond: ((the_date >= ('now'::cstring)::date) AND
>> (the_date <= ('now'::cstring)::date))
>>  Total runtime: 179.440 ms
>> (8 rows)
>>
>> Time: 246.094 ms
>>
>> Notice the red.  Which is sad because one would like it to realize that it
>> could propagate the index constraint onto td2.  That is on both sides of the
>> join do the green.
>>
>
> FWIW
>
> This is my plan result:
> version
> PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 4.8.2-19ubuntu1) 4.8.2, 64-bit
> All default settings
>
> using "BETWEEN"
>
>  QUERY PLAN
> Nested Loop  (cost=0.86..48.91 rows=1 width=8) (actual time=0.042..168.512
> rows=999 loops=1)
>   ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..8.46
> rows=1 width=8) (actual time=0.022..1.388 rows=999 loops=1)
> Index Cond: ((the_date >= ('now'::cstring)::date) AND (the_date <=
> ('now'::cstring)::date))
>   ->  Index Scan using toy_data_date3 on toy_data3 td2  (cost=0.42..40.44
> rows=1 width=8) (actual time=0.078..0.160 rows=1 loops=999)
> Index Cond: (the_date = td1.the_date)
> Filter: (td1.i = i)
> Rows Removed by Filter: 998
> Planning time: 0.353 ms
> Execution time: 169.692 ms
>
> using "="
>
> QUERY PLAN
> Hash Join  (cost=49.89..90.46 rows=1 width=8) (actual time=2.320..5.652
> rows=999 loops=1)
>   Hash Cond: (td1.i = td2.i)
>   ->  Index Scan using toy_data_date3 on toy_data3 td1  (cost=0.43..37.37
> rows=967 width=8) (actual time=0.014..1.168 rows=999 loops=1)
> Index Cond: (the_date = ('now'::cstring)::date)
>   ->  Hash  (cost=37.37..37.37 rows=967 width=8) (actual time=2.292..2.292
> rows=999 loops=1)
> Buckets: 1024  Batches: 1  Memory Usage: 48kB
> ->  Index Scan using toy_data_date3 on toy_data3 td2
> (cost=0.43..37.37 rows=967 width=8) (actual time=0.008..1.183 rows=999
> loops=1)
>   Index Cond: (the_date = ('now'::cstring)::date)
> Planning time: 0.326 ms
> Execution time: 6.673 ms
>
> I was hoping to be able to say more but alas cannot find the words.
>
> I'm surprised by the estimate of 1 rows for the td1 index scan in my 9.5
> query - and also why the 9.2 query would choose to sequential scan hash join
> in favor of what seems to be a superior index scan nested loop on a fraction
> of the table.
>
> The fact that the between doesn't get transitively applied to td2 through
> the td1=td2 condition, not as much...though whether the limitation is due to
> theory or implementation I do not know.

Quite simply the equivalence class mechanism 

  1   2   >