Re: [HACKERS] UPDATE of partition key

2017-11-13 Thread David Rowley
ure = mtstate->mt_transition_capture;

24. I know from reading the thread this name has changed before, but I
think delete_skipped seems like the wrong name for this variable in:

if (delete_skipped)
*delete_skipped = true;

Skipped is the wrong word here as that indicates like we had some sort
of choice and that we decided not to. However, that's not the case
when the tuple was concurrently deleted. Would it not be better to
call it "tuple_deleted" or even "success" and reverse the logic? It's
just a bit confusing that you're setting this to skipped before
anything happens. It would be nicer if there was a better way to do
this whole thing as it's a bit of a wart in the code. I understand why
the code exists though.

Also, I wonder if it's better to always pass a boolean here to save
having to test for NULL before setting it, that way you might consider
putting the success = false just before the return NULL, then do
success = true after the tuple is gone.
Failing that, putting: something like: success = false; /* not yet! */
where you're doing the if (deleted_skipped) test is might also be
better.

25. Comment "we should" should be "we must".

/*
* For some reason if DELETE didn't happen (for e.g. trigger
* prevented it, or it was already deleted by self, or it was
* concurrently deleted by another transaction), then we should
* skip INSERT as well, otherwise, there will be effectively one
* new row inserted.

Maybe just:
/* If the DELETE operation was unsuccessful, then we must not
* perform the INSERT into the new partition.

"for e.g." is not really correct in English. "For example, ..." or
just "e.g. ..." is correct. If you de-abbreviate the e.g. then you've
written "For exempli gratia", which translates to "For for example".

26. You're not really explaining what's going on here:

if (mtstate->mt_transition_capture)
saved_tcs_map = mtstate->mt_transition_capture->tcs_map;

You have a comment later to say you're about to "Revert back to the
transition capture map", but I missed the part that explained about
modifying it in the first place.

27. Comment does not explain how we're skipping checking the partition
constraint check in:

* We have already checked partition constraints above, so skip
* checking them here.

Maybe something like:

* We've already checked the partition constraint above, however, we
* must still ensure the tuple passes all other constraints, so we'll
* call ExecConstraints() and have it validate all remaining checks.

28. For table WITH OIDs, the OID should probably follow the new tuple
for partition-key-UPDATEs.

CREATE TABLE p (a BOOL NOT NULL, b INT NOT NULL) PARTITION BY LIST (a)
WITH OIDS;
CREATE TABLE P_true PARTITION OF p FOR VALUES IN('t');
CREATE TABLE P_false PARTITION OF p FOR VALUES IN('f');
INSERT INTO p VALUES('t', 10);
SELECT tableoid::regclass,oid,a FROM p;
 tableoid |  oid  | a
--+---+---
 p_true   | 16792 | t
(1 row)

UPDATE p SET a = 'f'; -- partition-key-UPDATE (oid has changed (it
probably shouldn't have))
SELECT tableoid::regclass,oid,a FROM p;
 tableoid |  oid  | a
--+---+---
 p_false  | 16793 | f
(1 row)

UPDATE p SET b = 20; -- non-partition-key-UPDATE (oid remains the same)

SELECT tableoid::regclass,oid,a FROM p;
 tableoid |  oid  | a
--+---+---
 p_false  | 16793 | f
(1 row)

I'll try to continue with the review tomorrow, but I think some other
reviews are also looming too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] No mention of CREATE STATISTICS in event trigger docs

2017-11-10 Thread David Rowley
A patch to fix this is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


event_trigger_statistics_doc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-09 Thread David Rowley
On 10 November 2017 at 16:30, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> In 0002, bms_add_range has a bit naive-looking loop
>
> +   while (wordnum <= uwordnum)
> +   {
> +   bitmapword mask = (bitmapword) ~0;
> +
> +   /* If working on the lower word, zero out bits below 'lower'. 
> */
> +   if (wordnum == lwordnum)
> +   {
> +   int lbitnum = BITNUM(lower);
> +   mask >>= lbitnum;
> +   mask <<= lbitnum;
> +   }
> +
> +   /* Likewise, if working on the upper word, zero bits above 
> 'upper' */
> +   if (wordnum == uwordnum)
> +   {
> +   int ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(upper) 
> + 1);
> +   mask <<= ushiftbits;
> +   mask >>= ushiftbits;
> +   }
> +
> +   a->words[wordnum++] |= mask;
> +   }
>
> Without some aggressive optimization, the loop takes most of the
> time to check-and-jump for nothing especially with many
> partitions and somewhat unintuitive.
>
> The following uses a bit tricky bitmap operation but
> is straightforward as a whole.
>
> =
> /* fill the bits upper from BITNUM(lower) (0-based) of the first word */
> a->workds[wordnum++] += ~(bitmapword)((1 << BITNUM(lower)) - 1);
>
> /* fill up intermediate words */
> while (wordnum < uwordnum)
>a->words[wordnum++] = ~(bitmapword) 0;
>
> /* fill up to BITNUM(upper) bit (0-based) of the last word */
> a->workds[wordnum++] |=
>  (~(bitmapword) 0) >> (BITS_PER_BITMAPWORD - (BITNUM(upper) - 1));
> =

No objections here for making bms_add_range() perform better, but this
is not going to work when lwordnum == uwordnum. You'd need to special
case that. I didn't think it was worth the trouble, but maybe it is...

I assume the += should be |=.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-07 Thread David Rowley
On 7 November 2017 at 01:52, David Rowley <david.row...@2ndquadrant.com> wrote:
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

Hi Amit,

I had another look over this today. Apologies if any of the review seems petty.

Here goes:

1. If test seems to be testing for a child that's a partitioned table,
but is testing for a non-NULL part_scheme.

/*
* If childrel is itself partitioned, add it and its partitioned
* children to the list being propagated up to the root rel.
*/
if (childrel->part_scheme && rel->part_scheme)

Should this code use IS_PARTITIONED_REL() instead? Seems a bit strange
to test for a NULL part_scheme

2. There's a couple of mistakes in my bms_add_range() code. I've
attached bms_add_range_fix.patch. Can you apply this to your tree?

3. This assert seems to be Asserting the same thing twice:

Assert(rel->live_partitioned_rels != NIL &&
   list_length(rel->live_partitioned_rels) > 0);

A List with length == 0 is always NIL.

4. get_partitions_from_clauses(), can you comment why you perform the
list_concat() there.

I believe this is there so that the partition bound from the parent is
passed down to the child so that we can properly eliminate all child
partitions when the 2nd level of partitioning is using the same
partition key as the 1st level. I think this deserves a paragraph of
comment to explain this.

5. Please add a comment to explain what's going on here in
classify_partition_bounding_keys()

if (partattno == 0)
{
partexpr = lfirst(partexprs_item);
partexprs_item = lnext(partexprs_item);
}

Looks like, similar to index expressions, that partition expressions
are attno 0 to mark to consume the next expression from the list.

Does this need validation that there are enough partexprs_item items
like what is done in get_range_key_properties()? Or is this validated
somewhere else?

6. Comment claims the if test will test something which it does not
seem to test for:

/*
* Redundant key elimination using btree-semantics based tricks.
*
* Only list and range partitioning use btree operator semantics, so
* skip otherwise.   Also, if there are expressions whose value is yet
* unknown, skip this step, because we need to compare actual values
* below.
*/
memset(keyclauses, 0, PARTITION_MAX_KEYS * sizeof(List *));
if (partkey->strategy == PARTITION_STRATEGY_LIST ||
partkey->strategy == PARTITION_STRATEGY_RANGE)

I was expecting this to be skipped when the clauses contained a
non-const, but it does not seem to.

7. Should be "compare them"

/*
* If the leftarg and rightarg clauses' constants are both of the type
* expected by "op" clause's operator, then compare then using the
* latter's comparison function.
*/

But if I look at the code "compare then using the latter's comparison
function." is not true, it seems to use op's comparison function not
rightarg's. With most of the calls op and rightarg are the same, but
not all of them. The function shouldn't make that assumption even if
the args op was always the same as rightarg.

8. remove_redundant_clauses() needs an overview comment of what the
function does.

9. The comment should explain what we would do in the case of key < 3
AND key <= 2 using some examples.

/* try to keep only one of <, <= */

10. Wondering why this loop runs backward?

for (s = BTMaxStrategyNumber; --s >= 0;)

Why not just:

for (s = 0; s < BTMaxStrategyNumber; s++)

I can't see a special reason for it to run backward. It seems unusual,
but if there's a good reason that I've failed to realise then it's
maybe worth a comment.

11. Pleae comment on why *constfalse = true is set here:

if (!chk || s == (BTEqualStrategyNumber - 1))
continue;

if (partition_cmp_args(partopfamily, partopcintype, chk, eq, chk,
   _result))
{
if (!test_result)
{
*constfalse = true;
return;
}
/* discard the redundant key. */
xform[s] = NULL;
}

Looks like we'd hit this in a case such as: WHERE key = 1 AND key > 1.

Also please add a comment when discarding the redundant key maybe
explain that equality is more useful than the other strategies when
there's an overlap.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


bms_add_range_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-06 Thread David Rowley
On 7 November 2017 at 01:52, David Rowley <david.row...@2ndquadrant.com> wrote:
> Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)

I have a little more review to share:

1. Missing "in" in comment. Should be "mentioned in"

 * get_append_rel_partitions
 * Return the list of partitions of rel that pass the clauses mentioned
 * rel->baserestrictinfo

2. Variable should be declared in inner scope with the following fragment:

void
set_basic_child_rel_properties(PlannerInfo *root,
   RelOptInfo *rel,
   RelOptInfo *childrel,
   AppendRelInfo *appinfo)
{
AttrNumber attno;

if (rel->part_scheme)
{

which makes the code the same as where you moved it from.

3. Normally lfirst() is assigned to a variable at the start of a
foreach() loop. You have code which does not follow this.

foreach(lc, clauses)
{
Expr   *clause;
int i;

if (IsA(lfirst(lc), RestrictInfo))
{
RestrictInfo *rinfo = lfirst(lc);

You could assign this to a Node * since the type is unknown to you at
the start of the loop.

4.
/*
* Useless if what we're thinking of as a constant is actually
* a Var coming from this relation.
*/
if (bms_is_member(rel->relid, constrelids))
continue;

should this be moved to just above the op_strict() test? This one seems cheaper.

5. Typo "paritions": /* No clauses to prune paritions, so scan all
partitions. */

But thinking about it more the comment should something more along the
lines of /* No useful clauses for partition pruning. Scan all
partitions. */

The key difference is that there might be clauses, just without Consts.

Actually, the more I look at get_append_rel_partitions() I think it
would be better if you re-shaped that if/else if test so that it only
performs the loop over the partindexes if it's been set.

I ended up with the attached version of the function after moving
things around a little bit.

I'm still reviewing but thought I'd share this part so far.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
/*
 * get_append_rel_partitions
 *		Return the list of partitions of rel that pass the clauses mentioned
 *		in rel->baserestrictinfo. An empty list is returned if no matching
 *		partitions were found.
 *
 * Returned list contains the AppendRelInfos of chosen partitions.
 */
static List *
get_append_rel_partitions(PlannerInfo *root,
		  RelOptInfo *rel,
		  RangeTblEntry *rte)
{
	List   *partclauses;
	bool	contains_const,
			constfalse;

	/*
	 * Get the clauses that match the partition key, including information
	 * about any nullness tests against partition keys.  Set keynullness to
	 * a invalid value of NullTestType, which 0 is not.
	 */
	partclauses = match_clauses_to_partkey(rel,
		   list_copy(rel->baserestrictinfo),
		   _const,
		   );

	if (!constfalse)
	{
		Relation		parent = heap_open(rte->relid, NoLock);
		PartitionDesc	partdesc = RelationGetPartitionDesc(parent);
		Bitmapset	   *partindexes;
		List		   *result = NIL;
		inti;

		/*
		 * If we have matched clauses that contain at least one constant
		 * operand, then use these to prune partitions.
		 */
		if (partclauses != NIL && contains_const)
			partindexes = get_partitions_from_clauses(parent, partclauses);

		/*
		 * Else there are no clauses that are useful to prune any paritions,
		 * so we must scan all partitions.
		 */
		else
			partindexes = bms_add_range(NULL, 0, partdesc->nparts - 1);

		/* Fetch the partition appinfos. */
		i = -1;
		while ((i = bms_next_member(partindexes, i)) >= 0)
		{
			AppendRelInfo *appinfo = rel->part_appinfos[i];

#ifdef USE_ASSERT_CHECKING
			RangeTblEntry *rte = planner_rt_fetch(appinfo->child_relid, root);

			/*
			 * Must be the intended child's RTE here, because appinfos are ordered
			 * the same way as partitions in the partition descriptor.
			 */
			Assert(partdesc->oids[i] == rte->relid);
#endif

			result = lappend(result, appinfo);
		}

		/* Record which partitions must be scanned. */
		rel->live_part_appinfos = result;

		heap_close(parent, NoLock);

		return result;
	}

	return NIL;
}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-06 Thread David Rowley
On 6 November 2017 at 23:01, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> OK, I have gotten rid of the min/max partition index interface and instead
> adopted the bms_add_range() approach by including your patch to add the
> same in the patch set (which is now 0002 in the whole set).  I have to
> admit that it's simpler to understand the new code with just Bitmapsets to
> look at, but I'm still a bit concerned about materializing the whole set
> right within partition.c, although we can perhaps optimize it later.

Thanks for making that change. The code looks much more simple now.

For performance, if you're worried about a very large number of
partitions, then I think you're better off using bms_next_member()
rather than bms_first_member(), (likely this applies globally, but you
don't need to worry about those).

The problem with bms_first_member is that it must always loop over the
0 words before it finds any bits set for each call, whereas
bms_next_member will start on the word it was last called for. There
will likely be a pretty big performance difference between the two
when processing a large Bitmapset.

> Attached updated set of patches, including the fix to make the new pruning
> code handle Boolean partitioning.

Thanks. I'll look over it all again starting my Tuesday morning. (UTC+13)



-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Removing useless DISTINCT clauses

2017-11-06 Thread David Rowley
In [1] we made a change to process the GROUP BY clause to remove any
group by items that are functionally dependent on some other GROUP BY
items.

This really just checks if a table's PK columns are entirely present
in the GROUP BY clause and removes anything else belonging to that
table.

All this seems to work well, but I totally failed to consider that the
exact same thing applies to DISTINCT too.

Over in [2], Rui Liu mentions that the planner could do a better job
for his case.

Using Rui Liu's example:

CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text);
INSERT into test_tbl select generate_series(1,1000), 'test';

Master:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;
   QUERY PLAN
-
 Limit  (cost=1658556.19..1658563.69 rows=1000 width=9) (actual
time=8934.962..8935.495 rows=1000 loops=1)
   Output: col, k
   ->  Unique  (cost=1658556.19..1733557.50 rows=1175 width=9)
(actual time=8934.961..8935.460 rows=1000 loops=1)
 Output: col, k
 ->  Sort  (cost=1658556.19..1683556.63 rows=1175 width=9)
(actual time=8934.959..8935.149 rows=1000 loops=1)
   Output: col, k
   Sort Key: test_tbl.k, test_tbl.col
   Sort Method: external merge  Disk: 215128kB
   ->  Seq Scan on public.test_tbl  (cost=0.00..154056.75
rows=1175 width=9) (actual time=0.062..1901.728 rows=1000
loops=1)
 Output: col, k
 Planning time: 0.092 ms
 Execution time: 8958.687 ms
(12 rows)

Patched:

postgres=# explain analyze verbose select distinct col, k from
test_tbl order by k limit 1000;

 QUERY PLAN
--
 Limit  (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895
rows=1000 loops=1)
   Output: col, k
   ->  Unique  (cost=0.44..338745.50 rows=1175 width=9) (actual
time=0.029..0.814 rows=1000 loops=1)
 Output: col, k
 ->  Index Scan using test_tbl_pkey on public.test_tbl
(cost=0.44..313745.06 rows=1175 width=9) (actual time=0.026..0.452
rows=1000 loops=1)
   Output: col, k
 Planning time: 0.152 ms
 Execution time: 0.985 ms
(8 rows)

A patch to implement this is attached.

I'll add it to the Jan commitfest. (I don't expect anyone to look at
this before then).


[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d4c3a156cb46dcd1f9f97a8011bd94c544079bb5

[2] 
https://www.postgresql.org/message-id/flat/CAKJS1f9q0j3BgMUsDbtf9%3DecfVLnqvkYB44MXj0gpVuamcN8Xw%40mail.gmail.com#CAKJS1f9q0j3BgMUsDbtf9=ecfvlnqvkyb44mxj0gpvuamcn...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_useless_distinct_clauses.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [Sender Address Forgery]Re: [HACKERS] path toward faster partition pruning

2017-11-05 Thread David Rowley
On 6 November 2017 at 17:30, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> On 2017/11/03 13:32, David Rowley wrote:
>> On 31 October 2017 at 21:43, Amit Langote <langote_amit...@lab.ntt.co.jp> 
>> wrote:
>> 1. This comment seem wrong.
>>
>> /*
>> * Since the clauses in rel->baserestrictinfo should all contain Const
>> * operands, it should be possible to prune partitions right away.
>> */
>
> Yes.  I used to think it was true, then realized it isn't and updated the
> code to get rid of that assumption, but I forgot updating this comment.
> Fixed.
>
>> How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
>> baserestrictinfo in this case will contain a single RestrictInfo with
>> an OpExpr containing two Var args and it'll come right through that
>> function too.

...

> We won't be able to use such a clause for pruning at all; neither
> planning-time pruning nor execution-time pruning.  Am I missing something?

I just meant the comment was wrong.

>
> The design with min/max partition index interface to the partition.c's new
> partition-pruning facility is intentional.  You can find hints about how
> such a design came about in the following Robert's email:
>
> https://www.postgresql.org/message-id/CA%2BTgmoYcv_MghvhV8pL33D95G8KVLdZOxFGX5dNASVkXO8QuPw%40mail.gmail.com

Yeah, I remember reading that before I had looked at the code. I
disagree with Robert on this. The fact that the min/max range gets
turned into a list of everything in that range in
get_append_rel_partitions means all the advantages that storing the
partitions as a range is voided. If you could have kept it a range the
entire time, then that might be different, but seems you need to
materialize the entire range in order to sort the partitions into
order. I've included Robert in just in case he wants to take a look at
the code that resulted from that design. Maybe something is not
following what he had in mind, or maybe he'll change his mind based on
the code that resulted.


> For range queries, it is desirable for the partitioning module to return
> the set of qualifying partitions that are contiguous in a compact (O(1))
> min/max representation than having to enumerate all those indexes in the
> set.  It's nice to avoid iterating over that set twice -- once when
> constructing the set in the partitioning module and then again in the
> caller (in this case, planner) to perform the planning-related tasks per
> selected partition.

The idea is that you still get the min and max from the bsearch, but
then use bms_add_range() to populate a bitmapset of the matching
partitions. The big-O notation of the search shouldn't change.

> We need the other_parts Bitmapset too, because selected partitions may not
> always be contiguous, even in the case of range partitioning, if there are
> OR clauses and the possibility that the default partition is also
> selected.  While computing the selected partition set from a given set of
> clauses, partitioning code tries to keep the min/max representation as
> long as it makes sense to and once the selected partitions no longer
> appear to be contiguous it switches to the Bitmapset mode.

Yip. I understand that. I just think there's no benefit to having
min/max since it needs to be materialized into a list of the entire
range at some point, it might as well be done as soon as possible
using a bitmapset, which would save having all the partset_union,
partset_intersect, partset_range_empty, partset_range_overlap,
partset_range_adjacent code. You'd end up just using bms_union and
bms_intersect then bms_add_range to handle the min/max bound you get
from the bsearch.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-05 Thread David Rowley
On 31 October 2017 at 21:43, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Attached updated version of the patches

match_clauses_to_partkey() needs to allow for the way quals on Bool
columns are represented.

create table pt (a bool not null) partition by list (a);
create table pt_true partition of pt for values in('t');
create table pt_false partition of pt for values in('f');
explain select * from pt where a = true;
QUERY PLAN
--
 Append  (cost=0.00..76.20 rows=2810 width=1)
   ->  Seq Scan on pt_false  (cost=0.00..38.10 rows=1405 width=1)
 Filter: a
   ->  Seq Scan on pt_true  (cost=0.00..38.10 rows=1405 width=1)
 Filter: a
(5 rows)

match_clause_to_indexcol() shows an example of how to handle this.

explain select * from pt where a = false;

will need to be allowed too. This works slightly differently.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-05 Thread David Rowley
On 3 November 2017 at 17:32, David Rowley <david.row...@2ndquadrant.com> wrote:
> 2. This code is way more complex than it needs to be.
>
> if (num_parts > 0)
> {
> int j;
>
> all_indexes = (int *) palloc(num_parts * sizeof(int));
> j = 0;
> if (min_part_idx >= 0 && max_part_idx >= 0)
> {
> for (i = min_part_idx; i <= max_part_idx; i++)
> all_indexes[j++] = i;
> }
> if (!bms_is_empty(other_parts))
> while ((i = bms_first_member(other_parts)) >= 0)
> all_indexes[j++] = i;
> if (j > 1)
> qsort((void *) all_indexes, j, sizeof(int), intcmp);
> }
>
> It looks like the min/max partition stuff is just complicating things
> here. If you need to build this array of all_indexes[] anyway, I don't
> quite understand the point of the min/max. It seems like min/max would
> probably work a bit nicer if you didn't need the other_parts
> BitmapSet, so I recommend just getting rid of min/max completely and
> just have a BitmapSet with bit set for each partition's index you
> need, you'd not need to go to the trouble of performing a qsort on an
> array and you could get rid of quite a chunk of code too.
>
> The entire function would then not be much more complex than:
>
> partindexes = get_partitions_from_clauses(parent, partclauses);
>
> while ((i = bms_first_member(partindexes)) >= 0)
> {
> AppendRelInfo *appinfo = rel->part_appinfos[i];
> result = lappend(result, appinfo);
> }
>
> Then you can also get rid of your intcmp() function too.

I've read a bit more of the patch and I think even more now that the
min/max stuff should be removed.

I understand that you'll be bsearching for a lower and upper bound for
cases like:

SELECT * FROM pt WHERE key BETWEEN 10 and 20;

but it looks like the min and max range stuff is thrown away if the
query is written as:

SELECT * FROM pt WHERE key BETWEEN 10 and 20 OR key BETWEEN 30 AND 40;

from reading the code, it seems like partset_union() would be called
in this case and if the min/max of each were consecutive then the
min/max range would get merged, but there's really a lot of code to
support this. I think it would be much better to invent
bms_add_range() and just use a Bitmapset to store the partition
indexes to scan. You could simply use bms_union for OR cases and
bms_intersect() or AND cases. It seems this would allow removal of
this complex code. It looks like this would allow you to remove all
the partset_range_* macros too.

I've attached a patch which implements bms_add_range() which will save
you from having to write the tight loops to call bms_add_member() such
as the ones in partset_union(). Those would not be so great if there
was a huge number of partitions as the Bitmapset->words[] array could
be expanded many more times than required. bms_add_range() will handle
that much more efficiently with a maximum of 1 repalloc() for the
whole operation. It would also likely faster since it's working at the
bitmapword level rather than bit level.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Add-bms_add_range-to-add-members-within-the-specifie.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ArrayLists instead of List (for some things)

2017-11-02 Thread David Rowley
On 3 November 2017 at 03:26, Craig Ringer <cr...@2ndquadrant.com> wrote:
> On 2 November 2017 at 22:22, David Rowley <david.row...@2ndquadrant.com> 
> wrote:
>> Maybe, but the new implementation is not going to do well with places
>> where we perform lcons(). Probably many of those places could be
>> changed to lappend(), but I bet there's plenty that need prepend.
>
> Yeah, and it's IMO significant that pretty much every nontrivial
> system with ADTs offers multiple implementations of list data
> structures, often wrapped with a common API.
>
> Java's Collections, the STL, you name it.

I've never really looked at much Java, but I've done quite a bit of
dotnet stuff in my time and they have a common interface ICollection
and various data structures that implement that interface. Which is
implemented by a bunch of classes, something like:

ConcurrentDictionary (hash table)
Dictionary (hash table)
HashSet (hash table)
LinkedList (some kinda linked list)
List (Arrays, probably)
SortedDictionary (bst, I think)
SortedList (no idea, perhaps a btree)
SortedSet
Bag (no idea, but does not care about any order)

Probably there are more, but the point is that they obviously don't
believe there's a one-size-fits-all type. I don't think we should do
that either. However, I've proved nothing on the performance front
yet, so that should probably be my next step.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-02 Thread David Rowley
On 31 October 2017 at 21:43, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Attached updated version of the patches addressing some of your comments

I've spent a bit of time looking at these but I'm out of time for now.

So far I have noted down the following;

1. This comment seem wrong.

/*
* Since the clauses in rel->baserestrictinfo should all contain Const
* operands, it should be possible to prune partitions right away.
*/

How about PARTITION BY RANGE (a) and SELECT * FROM parttable WHERE a > b; ?
baserestrictinfo in this case will contain a single RestrictInfo with
an OpExpr containing two Var args and it'll come right through that
function too.

2. This code is way more complex than it needs to be.

if (num_parts > 0)
{
int j;

all_indexes = (int *) palloc(num_parts * sizeof(int));
j = 0;
if (min_part_idx >= 0 && max_part_idx >= 0)
{
for (i = min_part_idx; i <= max_part_idx; i++)
all_indexes[j++] = i;
}
if (!bms_is_empty(other_parts))
while ((i = bms_first_member(other_parts)) >= 0)
all_indexes[j++] = i;
if (j > 1)
qsort((void *) all_indexes, j, sizeof(int), intcmp);
}

It looks like the min/max partition stuff is just complicating things
here. If you need to build this array of all_indexes[] anyway, I don't
quite understand the point of the min/max. It seems like min/max would
probably work a bit nicer if you didn't need the other_parts
BitmapSet, so I recommend just getting rid of min/max completely and
just have a BitmapSet with bit set for each partition's index you
need, you'd not need to go to the trouble of performing a qsort on an
array and you could get rid of quite a chunk of code too.

The entire function would then not be much more complex than:

partindexes = get_partitions_from_clauses(parent, partclauses);

while ((i = bms_first_member(partindexes)) >= 0)
{
AppendRelInfo *appinfo = rel->part_appinfos[i];
result = lappend(result, appinfo);
}

Then you can also get rid of your intcmp() function too.

3. Following code has the wrong size calculation:

memset(keynullness, -1, PARTITION_MAX_KEYS * sizeof(NullTestType *));

should be PARTITION_MAX_KEYS * sizeof(NullTestType). It might have
worked on your machine if you're compiling as 32 bit.

I'll continue on with the review in the next few days.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-11-02 Thread David Rowley
On 31 October 2017 at 21:43, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> Attached updated version of the patches addressing some of your comments
> above and fixing a bug that Rajkumar reported [1].  As mentioned there,
> I'm including here a patch (the 0005 of the attached) to tweak the default
> range partition constraint to be explicit about null values that it might
> contain.  So, there are 6 patches now and what used to be patch 0005 in
> the previous set is patch 0006 in this version of the set.

Hi Amit,

I've been looking over this. I see the latest patches conflict with
cf7ab13bf. Can you send patches rebased on current master?

Thanks

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ArrayLists instead of List (for some things)

2017-11-02 Thread David Rowley
On 3 November 2017 at 03:27, Stephen Frost <sfr...@snowman.net> wrote:
> * David Rowley (david.row...@2ndquadrant.com) wrote:
>> We could get around a few of these problems if Lists were implemented
>> internally as arrays, however, arrays are pretty bad if we want to
>> delete an item out the middle as we'd have to shuffle everything over
>> one spot. Linked lists are much better here... at least they are when
>> you already have the cell you want to remove... so we can't expect
>> that we can just rewrite List to use arrays instead.
>
> This actually depends on just how you delete the item out of the array,
> though there's a trade-off there.  If you "delete" the item by marking
> it as "dead" then you don't need to re-shuffle everything, but, of
> course, then you have to make sure to 'skip' over those by running down
> the array for each list_nth() call- but here's the thing, with a small
> list that's all in a tighly packed array, that might not be too bad.
> Certainly far better than having to grab random memory on each step and
> I'm guessing you'd only actually store the "data" item for a given list
> member in the array if it's a integer/oid/etc list, not when it's
> actually a pointer being stored in the list, so you're always going to
> be working with pretty tightly packed arrays where scanning the list on
> the list_nth call really might be darn close to as fast as using an
> offset to the actual entry in the array, unless it's a pretty big list,
> but I don't think we've actually got lots of multi-hundred-deep lists.

I was hoping to have list_nth as always O(1) so that we'd never be
tempted again to use arrays directly like we are not for things like
simle_rel_array[].
Probably it could be made to work quickly in most cases by tracking if
there are any deleted items and just do the direct array lookups if
nothing is deleted, and if something is deleted then visit index n
minus bms_num_members() of some deleted bitmapset for the bits earlier
than the Nth element. bms_num_members() could be made faster with
64bit bitmap words and __builtin_popcountl() falling back on the
number_of_ones[] array when not available. Would also require a
bms_num_members_before() function.

In the implementation that I attached, I showed an example of how to
delete items out an array list, which I think is quite a bit cleaner
than how we generally do it today with List. (e.g in
reconsider_outer_join_clauses()) However, it'll still perform badly
when just a single known item is being removed.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ArrayLists instead of List (for some things)

2017-11-02 Thread David Rowley
On 3 November 2017 at 03:38, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> On 3 November 2017 at 03:17, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> We've jacked up the List API and driven a new implementation underneath
>>> once before.  Maybe it's time to do that again.
>
>> Maybe, but the new implementation is not going to do well with places
>> where we perform lcons(). Probably many of those places could be
>> changed to lappend(), but I bet there's plenty that need prepend.
>
> [ shrug... ] To me, that means this implementation isn't necessarily
> the right solution.

True, of course, could be worked around probably with some bit flags
to record if lcons was called, and then consume the elements in
reverse. We might have to shuffle things along a bit for Lists that
get lcons and lappend calls.

The API break that I can't see a way around is:

 /* does the list have 0 elements? */
if (list == NIL) ...

That would need to be rewritten as:

if (list_length(list) == 0) ...

> It seems to me that a whole lot of the complaints about this could be
> resolved simply by improving the List infrastructure to allocate ListCells
> in batches.  That would address the question of "too much palloc traffic"
> and greatly improve the it-accesses-all-over-memory situation too.
>
> Possibly there are more aggressive changes that could be implemented
> without breaking too many places, but I think it would be useful to
> start there and see what it buys us.

Yeah, certainly would be a good way of determining the worth of changing.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ArrayLists instead of List (for some things)

2017-11-02 Thread David Rowley
On 3 November 2017 at 03:17, Tom Lane <t...@sss.pgh.pa.us> wrote:
> We've jacked up the List API and driven a new implementation underneath
> once before.  Maybe it's time to do that again.

Maybe, but the new implementation is not going to do well with places
where we perform lcons(). Probably many of those places could be
changed to lappend(), but I bet there's plenty that need prepend.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ArrayLists instead of List (for some things)

2017-11-02 Thread David Rowley
Hackers,

Our List implementation internally uses linked lists which are
certainly good for some things, but pretty bad at other things. Linked
lists are pretty bad when you want to fetch the Nth element, as it
means looping ever each previous element until you get to the Nth.
They're also pretty bad for a modern CPU due to their memory access
pattern.

We could get around a few of these problems if Lists were implemented
internally as arrays, however, arrays are pretty bad if we want to
delete an item out the middle as we'd have to shuffle everything over
one spot. Linked lists are much better here... at least they are when
you already have the cell you want to remove... so we can't expect
that we can just rewrite List to use arrays instead.

I've attached a first cut implementation of "AList". I was originally
going to call it "ArrayList", but my fingers were getting tired, so I
change it.

This first cut version replaces nothing in the code base to use
ALists. This email (and due to the timing of it) is more of a marker
that this is being worked on. I know in particular that Andres has
complained at least once about List usages in the executor, which I
agree is not great. Looking at the usage of list_nth() is quite scary!

My plans for this include:

* Make targetlists from createplan onwards use ALists. Every time I
look at build_path_tlist() I gawk at the number of palloc calls that
are going on inside those lappend() calls. We already know how many
items will be in the list, so with AList we could just
alist_premake(list_length(path->pathtarget->exprs)) and we'd never
have to palloc() another element for that list again. We do the same
again in various mutator functions in setrefs.c too!

* list_nth usage in adjust_appendrel_attrs_mutator() to speed up
translation between parent and child Vars

* And also, umm,  simple_rte_array and
simple_rel_array. Well, we'd still need somewhere to store
all the RelOptInfos, but the idea is that it would be rather nice if
we could not expand the entire inheritance hierarchy of a relation,
and it would be rather nice if we could just add the partitions that
we need, rather than add them all and setting dummy paths for the ones
we're not interested in.

Anyway, please don't debate the usages of the new type here. As for
all the above plans, I admit to not having a full handle on them yet.
I wrote the Array List implementation so I could get a better idea on
how possible each of those would be, so, for now, to prevent any
duplicate work, here it is...

(Including in Andres because I know he's mentioned his dislike for
List in the executor)

Comments on the design are welcome, but I was too late to the
commitfest, so there are other priorities. However, if you have a
strong opinion, feel free to voice it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Basic-implementation-of-array-lists-AList.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Removing LEFT JOINs in more cases

2017-10-31 Thread David Rowley
Hackers,

Normally we'll only ever remove a LEFT JOIN relation if it's unused
and there's no possibility that the join would cause row duplication.
To check that the join wouldn't cause row duplicate we make use of
proofs, such as unique indexes, or for sub-queries, we make use of
DISTINCT and GROUP BY clauses.

There's another case that we don't handle, and it's VERY simple to test for.

Quite simply, it seems we could remove the join in cases such as:

create table t1 (id int primary key);
create table t2 (id int primary key, b int not null);

insert into t2 values(1,1),(2,1);
insert into t1 values(1);

select distinct t1.* from t1 left join t2 on t1.id=t2.b;

and

select t1.id from t1 left join t2 on t1.id=t2.b GROUP BY t1.id;

but not:

select t1.id,count(*) from t1 left join t2 on t1.id=t2.b GROUP BY t1.id;

In this case, the join *can* cause row duplicates, but the distinct or
group by would filter these out again anyway, so in these cases, we'd
not only get the benefit of not joining but also not having to remove
the duplicate rows caused by the join.

Given how simple the code is to support this, it seems to me to be
worth handling.

A patch to do this is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Support-removing-LEFT-JOINs-with-DISTINCT-GROUP-BY.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An unlikely() experiment

2017-10-30 Thread David Rowley
On 30 October 2017 at 22:44, Andres Freund <and...@anarazel.de> wrote:
> On 2017-10-30 22:39:01 +1300, David Rowley wrote:
>> Today I was thinking, to get around that issue, we might be able to
>> generate another thin wrapper around elog_start() and mark that as
>> __attribute__((cold)) and fudge the macro a bit to call that function
>> instead if it can detect a compile time const level >= ERROR. I've not
>> looked at the code again to remind myself if that would be possible.
>
> Yes, that's what I was thinking too. Add a elog_fatal() wrapping
> elog_finish(), and move the if (__builtin_constant_p(elevel)) branch a
> bit earlier, and that should work.  Similar with errstart_fatal() for
> ereport().

This may have been too good to be true. I can't seem to get it to work
and I think it's because the function is inside the do{}while(0) and
the if (__builtin_constant_p(elevel) && (elevel) >= ERROR) branch,
which appears to mean that:

"The paths leading to call of cold functions within code are marked as
unlikely by the branch prediction mechanism" [1]

is not the path that the macro is in in the calling function, like we
might have hoped.

I can get the assembly to change if I put an unlikely() around the
condition or if I go and vandalize the macro to become:

#define elog(elevel, ...) \
elog_start_error(__FILE__, __LINE__, PG_FUNCNAME_MACRO)

[1] https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Function-Attributes.html



-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] An unlikely() experiment

2017-10-30 Thread David Rowley
On 30 October 2017 at 22:34, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> On 2015-12-20 02:49:13 +1300, David Rowley wrote:
>> Alternatively, if there was some way to mark the path as cold from within
>> the path, rather than from the if() condition before the path, then we
>> could perhaps do something in the elog() macro instead. I just couldn't
>> figure out a way to do this.
>
> I just was thinking about this, and it turns out that
> __attribute__((cold)) does what we need here. We could just mark
> elog_start() and errstart() as cold, and afaict that should do the
> trick.

I had actually been thinking about this again today. I had previously
not thought  __attribute__((cold)) would be a good idea since if we
mark elog_start() with that, then code paths with elog(NOTICE) and
elog(LOG) not to mention DEBUG would be marked as cold. Today I was
thinking, to get around that issue, we might be able to generate
another thin wrapper around elog_start() and mark that as
__attribute__((cold)) and fudge the macro a bit to call that function
instead if it can detect a compile time const level >= ERROR. I've not
looked at the code again to remind myself if that would be possible.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-26 Thread David Rowley
On 27 October 2017 at 01:44, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> I think Antonin has a point. The join processing may deem some base
> relations dummy (See populate_joinrel_with_paths()). So an Append
> relation which had multiple child alive at the end of
> set_append_rel_size() might ultimately have only one child after
> partition-wise join has worked on it.

hmm, I see. partition-wise join is able to remove eliminate partitions
on the other side of the join that can't be matched to:

# set enable_partition_wise_join=on;
SET
# explain select count(*) from ab ab1 inner join ab ab2 on ab1.a=ab2.a
where ab1.a between 1 and 2 and ab2.a between 1 and 10001;
   QUERY PLAN

 Aggregate  (cost=0.00..0.01 rows=1 width=8)
   ->  Result  (cost=0.00..0.00 rows=0 width=0)
 One-Time Filter: false
(3 rows)

# \d+ ab
Table "public.ab"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  |
 b  | integer |   |  | | plain   |  |
Partition key: RANGE (a)
Partitions: ab_p1 FOR VALUES FROM (1) TO (1),
ab_p2 FOR VALUES FROM (10000) TO (2)



-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-26 Thread David Rowley
On 26 October 2017 at 23:42, Antonin Houska <a...@cybertec.at> wrote:
> David Rowley <david.row...@2ndquadrant.com> wrote:
>
>> Method 1:
>>
>> In set_append_rel_size() detect when just a single subpath would be
>> added to the Append path.
>
> I spent some time reviewing the partition-wise join patch during the last CF
> and have an impression that set_append_rel_size() is called too early to find
> out how many child paths will eventually exist if Append represents a join of
> a partitioned table. I think the partition matching takes place at later stage
> and that determines the actual number of partitions, and therefore the number
> of Append subpaths.

I understand that we must wait until set_append_rel_size() is being
called on the RELOPT_BASEREL since partitions may themselves be
partitioned tables and we'll only know what's left after we've
recursively checked all partitions of the baserel.

I've not studied the partition-wise join code yet, but if we've
eliminated all but one partition in set_append_rel_size() then I
imagine we'd just need to ensure partition-wise join is not attempted
since we'd be joining directly to the only partition anyway.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-26 Thread David Rowley
On 26 October 2017 at 23:30, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Oct 25, 2017 at 11:59 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> As of today, because we include this needless [Merge]Append node, we
>> cannot parallelise scans below the Append.
>
> Without disputing your general notion that singe-child Append or
> MergeAppend nodes are a pointless waste, how does a such a needless
> node prevent parallelizing scans beneath it?

hmm, I think I was wrong about that now. I had been looking over the
regression test diffs after having made tenk1 a partitioned table with
a single partition containing all the rows, but it looks like I read
the diff backwards. The planner actually parallelized the Append
version, not the non-Append version, like I had previously thought.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Removing [Merge]Append nodes which contain a single subpath

2017-10-25 Thread David Rowley
It seems like a good idea for the planner not to generate Append or
MergeAppend paths when the node contains just a single subpath. If we
were able to remove these then the planner would have more flexibility
to build a better plan.

As of today, because we include this needless [Merge]Append node, we
cannot parallelise scans below the Append. We must also include a
Materialize node in Merge Joins since both MergeAppend and Append
don't support mark and restore. Also, as shown in [1], there's an
overhead to pulling tuples through nodes.

I've been looking into resolving this issue but the ways I've explored
so far seems to be bending the current planner a bit out of shape.

Method 1:

In set_append_rel_size() detect when just a single subpath would be
added to the Append path. Set a promotion_child RelOptInfo field in
the base rel's RelOptInfo, and during make_one_rel, after
set_base_rel_sizes() modify the joinlist to get rid of the parent and
use the child instead.

This pretty much breaks the assumption we have that the finalrel will
have all the relids to root->all_baserels. We have an Assert in
make_one_rel() which checks this is true.

There are also complications around set_rel_size() generating paths
for subqueries which may be parameterized paths with the Append parent
required for the parameterization to work.

Method 2:

Invent a ProxyPath concept that allows us to add Paths belonging to
one relation to another relation. The ProxyPaths can have
translation_vars to translate targetlists to reference the correct
Vars. These ProxyPaths could exist up as far as createplan, where we
could perform the translation and build the ProxyPath's subnode
instead.

This method is not exactly very clean either as there are various
places around the planner we'd need to give special treatment to these
ProxyPaths, such as is_projection_capable_path() would need to return
the projection capability of the subpath within the ProxyPath since we
never actually create a "Proxy" node.

Probably either of these two methods could be made to work. I prefer
method 2, since that infrastructure could one day be put towards
scanning a Materialized View instead of the relation. in order to
satisfy some query. However, method 2 appears to also require some Var
translation in Path targetlists which contain this Proxy path, either
that or some global Var replacement would need to be done during
setrefs.

I'm posting here in the hope that it will steer my development of this
in a direction that's acceptable to the community.

Perhaps there is also a "Method 3" which I've not thought about which
would be much cleaner.

[1] 
https://www.postgresql.org/message-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B%3DZRh-rxy9qxfPA5Gw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] A handful of typos in allpaths.c

2017-10-17 Thread David Rowley
A small patch to fix these is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


allpath_typos_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-16 Thread David Rowley
On 15 October 2017 at 06:49, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Oct 13, 2017 at 4:49 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> tps = 8282.481310 (including connections establishing)
>> tps = 8282.750821 (excluding connections establishing)
>
> vs.
>
>> tps = 8520.822410 (including connections establishing)
>> tps = 8521.132784 (excluding connections establishing)
>>
>> With the patch we are making use of the extended statistics, which we
>> do expect to be more work for the planner. Although, we didn't add
>> extended statistics to speed up the planner.
>
> Sure, I understand.  That's actually a pretty substantial regression -
> I guess that means that it's pretty important to avoid creating
> extended statistics that are not needed, at least for short-running
> queries.

To be honest, I ran that on a VM on my laptop. I was getting quite a
bit of noise. I just posted that to show that the 12x slowdown didn't
exist. I don't know what the actual slowdown is. I just know extended
stats are not free and that nobody expected that they ever would be.
The good news is that they're off by default and if the bad ever
outweighs the good then the fix for that starts with "DROP STATISTICS"

I personally think it's great we're starting to see a useful feature
materialise that can help with poor row estimates from the planner.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-13 Thread David Rowley
On 12 October 2017 at 04:50, Robert Haas <robertmh...@gmail.com> wrote:
> We haven't really done a very good job figuring out what to do about
> optimizations that some people (mostly you) think are marginal.  It's
> obviously true that we can't just burn infinite planner cycles on
> things that don't usually help, but at the same time, we can't just
> keep ignoring requests by users for the same features and saying
> "you'll be sad if we give this to you".  Those people don't give up on
> wanting the optimization; they just don't use PostgreSQL.  I think we
> need to stop just saying "no" and start thinking about what we could
> do that would let us say "yes".

I'm with Robert on this.  Planning time *is* important, but if we were
to do a quick tally on the number of patches that we've seen in the
last few years to improve execution time by either improving the
executor code or adding some smarts to the planner to reduce executor
work vs patches aimed to reduce planning time, I think we'd find the
general focus is on the executor. Personally, I don't recall a single
patch aimed to just speed up the planner. Perhaps I missed some? It
looks like there's plenty we could do in there, just nobody seems
interested enough to go and do it, everyone who cares about
performance is too busy trying to make execution run faster.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-13 Thread David Rowley
On 14 October 2017 at 09:04, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Oct 9, 2017 at 11:03 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> -- Unpatched
>>  Planning time: 0.184 ms
>>  Execution time: 105.878 ms
>>
>> -- Patched
>>  Planning time: 2.175 ms
>>  Execution time: 106.326 ms
>
> This might not be the best example to show the advantages of the
> patch, honestly.

The focus was on the row estimate. I try to highlight that by
mentioning "Note rows=1 vs rows=98 in the Gather node.". I can't
imagine the test I added would have made the planner about 12 times
slower, but just for the record:

create table ab (a varchar, b varchar);
insert into ab select (x%1000)::varchar, (x%1)::Varchar from
generate_Series(1,100)x;
create statistics ab_a_b_stats (dependencies) on a,b from ab;
vacuum analyze ab;

$ cat a.sql
explain select * from ab where a = '1' and b = '1';

e9ef11ac8bb2acc2d2462fc17ec3291a959589e7 (Patched)

$ pgbench -f a.sql -T 60 -n
transaction type: a.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 496950
latency average = 0.121 ms
tps = 8282.481310 (including connections establishing)
tps = 8282.750821 (excluding connections establishing)

e9ef11ac8bb2acc2d2462fc17ec3291a959589e7~1 (Unpatched)

$ pgbench -f a.sql -T 60 -n
transaction type: a.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 511250
latency average = 0.117 ms
tps = 8520.822410 (including connections establishing)
tps = 8521.132784 (excluding connections establishing)

With the patch we are making use of the extended statistics, which we
do expect to be more work for the planner. Although, we didn't add
extended statistics to speed up the planner.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-13 Thread David Rowley
On 13 October 2017 at 19:36, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
> I have tried exactly same tests to get to this factor on my local developer
> machine. And with parallelism enabled I got this number as 7.9. However, if
> I disable the parallelism (and I believe David too disabled that), I get
> this number as 1.8. Whereas for 1 rows, I get this number to 1.7
>
> -- With Gather
> # select current_Setting('cpu_tuple_cost')::float8 / ((10633.56 * (81.035 /
> 72.450) - 10633.56)  / 100);
> 7.9
>
> -- Without Gather
> # select current_Setting('cpu_tuple_cost')::float8 / ((16925.01 * (172.838 /
> 131.400) - 16925.01)  / 100);
> 1.8
>
> -- With 1 rows (so no Gather too)
> # select current_Setting('cpu_tuple_cost')::float8 / ((170.01 * (1.919 /
> 1.424) - 170.01)  / 1);
> 1.7
>
> So it is not so straight forward to come up the correct heuristic here. Thus
> using 50% of cpu_tuple_cost look good to me here.
>
> As suggested by Ashutosh and Robert, attached separate small WIP patch for
> it.

Good to see it stays fairly consistent at different tuple counts, and
is not too far away from what I got on this machine.

I looked over the patch and saw this:

@@ -1800,6 +1827,9 @@ cost_merge_append(Path *path, PlannerInfo *root,
  */
  run_cost += cpu_operator_cost * tuples;

+ /* Add MergeAppend node overhead like we do it for the Append node */
+ run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples;
+
  path->startup_cost = startup_cost + input_startup_cost;
  path->total_cost = startup_cost + run_cost + input_total_cost;
 }

You're doing that right after a comment that says we don't do that. It
also does look like the "run_cost += cpu_operator_cost * tuples" is
trying to do the same thing, so perhaps it's worth just replacing
that, which by default will double that additional cost, although
doing so would have the planner slightly prefer a MergeAppend to an
Append than previously.

+#define DEFAULT_APPEND_COST_FACTOR 0.5

I don't really think the DEFAULT_APPEND_COST_FACTOR adds much. it
means very little by itself. It also seems that most of the other cost
functions just use the magic number.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate transition state merging vs. hypothetical set functions

2017-10-12 Thread David Rowley
On 13 October 2017 at 12:41, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> If the user defines their normal aggregate as not safe for merging,
>> then surely it'll not be suitable to be used as a window function
>> either, since the final function will also be called there multiple
>> times per state.
>
> Yeah, we would probably also want to check the flag in nodeWindowAgg.
> Not sure exactly how that should play out --- maybe we end up with
> a tri-valued property "works as normal agg without merging, works
> as normal agg with merging, works as window agg".  But this would
> arguably be an improvement over the current situation.  Right now
> I'm sure there are user-written aggs out there that will just crash
> if used as a window agg, and the authors don't have much choice because
> the performance costs of not modifying the transition state in the
> finalfn are higher than they're willing to bear.  At least with a
> flag they could ensure that the case will fail cleanly.

hmm, maybe I'm lacking imagination here, but surely the final function
is either destructive or it's not? I can't understand what the
difference between nodeAgg.c calling the finalfn multiple times on the
same state and nodeWindowAgg.c doing it. Maybe there's something I'm
not accounting for that you are?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregate transition state merging vs. hypothetical set functions

2017-10-12 Thread David Rowley
On 13 October 2017 at 12:08, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Therefore, I think we need to bite the bullet and provide an aggregate
> property (CREATE AGGREGATE argument / pg_aggregate column) that tells
> whether the aggregate supports transition state merging.  Likely this
> should have been in the state-merging patch to begin with, but better
> late than never.
>
> The main thing that probably has to be hashed out before we can write
> that patch is what the default should be for user-created aggregates.
> I am inclined to think that we should err on the side of safety and
> default it to false (no merge support).  You could argue that the
> lack of complaints since 9.6 came out is sufficient evidence that
> defaulting to true would be all right, but I'm not sure.

Are you considering that this is an option only for ordered-set
aggregates or for all?

If the user defines their normal aggregate as not safe for merging,
then surely it'll not be suitable to be used as a window function
either, since the final function will also be called there multiple
times per state.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-12 Thread David Rowley
On 13 October 2017 at 04:56, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> I pushed your original fix.

Thanks for committing

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-12 Thread David Rowley
On 13 October 2017 at 03:00, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Laurenz Albe <laurenz.a...@cybertec.at> writes:
>> Robert Haas wrote:
>>> One trick that some system use is avoid replanning as much as we do
>>> by, for example, saving plans in a shared cache and reusing them even
>>> in other sessions.  That's hard to do in our architecture because the
>>> controlling GUCs can be different in every session and there's not
>>> even any explicit labeling of which GUCs control planner behavior. But
>>> if you had it, then extra planning cycles would be, perhaps, more
>>> tolerable.
>
>> From my experience with Oracle I would say that that is a can of worms.
>
> Yeah, I'm pretty suspicious of the idea too.  We've had an awful lot of
> bad experience with local plan caching, to the point where people wonder
> why we don't just auto-replan every time.  How would a shared cache
> make that better?  (Even assuming it was otherwise free, which it
> surely won't be.)

One idea I had when working on unique joins was that we could use it
to determine if a plan could only return 0 or 1 row by additionally
testing baserestrictinfo of each rel to see if an equality OpExpr with
a const Const matches a unique constraint which would serve as a
guarantee that only 0 or 1 row would match. I thought that these
single row plans could be useful as a start for some pro-active plan
cache. The plan here should be stable as they're not prone to any data
skew that could cause a plan change. You might think it would be very
few plans would fit the bill, but you'd likely find that the majority
of UPDATE and DELETE queries run in an OTLP application would be
eligible, and likely some decent percentage of SELECTs too.

I had imagined this would be some backend local cache that saved MRU
plans up to some size of memory defined by a GUC, where 0 would
disable the feature. I never got any further than those thoughts.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-12 Thread David Rowley
On 13 October 2017 at 02:17, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> I propose this slightly larger change.

hmm, this is not right. You're not checking that there's a Var below
the RelabelType.

I tried with:

explain select * from ab where (a||a)::varchar = '' and b = '';

and your code assumed the OpExpr was a Var.

The reason Tomas coded it the way it was coded is due to the fact that
there's already code that works exactly the same way in
clauselist_selectivity(). Personally, I don't particularly like that
code, but I'd rather not invent a new way to do the same thing.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-10 Thread David Rowley
On 10 October 2017 at 17:57, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> Append node just returns the result of ExecProcNode(). Charging
> cpu_tuple_cost may make it too expensive. In other places where we
> charge cpu_tuple_cost there's some processing done to the tuple like
> ExecStoreTuple() in SeqNext(). May be we need some other measure for
> Append's processing of the tuple.

I don't think there's any need to invent any new GUC. You could just
divide cpu_tuple_cost by something.

I did a quick benchmark on my laptop to see how much Append really
costs, and with the standard costs the actual cost seems to be about
cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be
realistic. create_set_projection_path() does something similar and
brincostestimate() does some similar magic and applies 0.1 *
cpu_operator_cost to the total cost.

# create table p (a int, b int);
# create table p1 () inherits (p);
# insert into p1 select generate_series(1,100);
# vacuum analyze p1;
# \q
$ echo "select count(*) from p1;" > p1.sql
$ echo "select count(*) from p;" > p.sql
$ pgbench -T 60 -f p1.sql -n

latency average = 58.567 ms

$ pgbench -T 60 -f p.sql -n
latency average = 72.984 ms

$ psql
psql (11devel)
Type "help" for help.

# -- check the cost of the plan.
# explain select count(*) from p1;
QUERY PLAN
--
 Aggregate  (cost=16925.00..16925.01 rows=1 width=8)
   ->  Seq Scan on p1  (cost=0.00..14425.00 rows=100 width=0)
(2 rows)

# -- selecting from the parent is the same due to zero Append cost.
# explain select count(*) from p;
   QUERY PLAN

 Aggregate  (cost=16925.00..16925.01 rows=1 width=8)
   ->  Append  (cost=0.00..14425.00 rows=101 width=0)
 ->  Seq Scan on p  (cost=0.00..0.00 rows=1 width=0)
 ->  Seq Scan on p1  (cost=0.00..14425.00 rows=100 width=0)
(4 rows)

# -- extrapolate the additional time taken for the Append scan and
work out what the planner
# -- should add to the plan's cost, then divide by the number of rows
in p1 to work out the
# -- tuple cost of pulling a row through the append.
# select (16925.01 * (72.984 / 58.567) - 16925.01)  / 100;
?column?

 0.00416630302337493743
(1 row)

# show cpu_tuple_cost;
 cpu_tuple_cost

 0.01
(1 row)

# -- How does that compare to the cpu_tuple_cost?
# select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743;
?column?

 2.400209476818
(1 row)

Maybe it's worth trying with different row counts to see if the
additional cost is consistent, but it's probably not worth being too
critical here.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-09 Thread David Rowley
Basically, $subject is causing us not to properly find matching
extended stats in this case.

The attached patch fixes it.

The following test cases is an example of the misbehaviour. Note
rows=1 vs rows=98 in the Gather node.

create table ab (a varchar, b varchar);
insert into ab select (x%1000)::varchar, (x%1)::Varchar from
generate_Series(1,100)x;
create statistics ab_a_b_stats (dependencies) on a,b from ab;
analyze ab;

-- Unpatched
explain analyze select * from ab where a = '1' and b = '1';
   QUERY PLAN
-
 Gather  (cost=1000.00..12466.10 rows=1 width=7) (actual
time=0.441..90.515 rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on ab  (cost=0.00..11466.00 rows=1 width=7)
(actual time=1.081..74.944 rows=33 loops=3)
 Filter: (((a)::text = '1'::text) AND ((b)::text = '1'::text))
 Rows Removed by Filter: 00
 Planning time: 0.184 ms
 Execution time: 105.878 ms
(8 rows)

-- Patched
explain analyze select * from ab where a = '1' and b = '1';
QUERY PLAN
--
 Gather  (cost=1000.00..12475.80 rows=98 width=7) (actual
time=1.076..92.595 rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on ab  (cost=0.00..11466.00 rows=41 width=7)
(actual time=0.491..77.833 rows=33 loops=3)
 Filter: (((a)::text = '1'::text) AND ((b)::text = '1'::text))
 Rows Removed by Filter: 00
 Planning time: 2.175 ms
 Execution time: 106.326 ms
(8 rows)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


allow_relabelled_vars_in_dependency_stats.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-09 Thread David Rowley
On 10 October 2017 at 01:10, Jeevan Chalke
<jeevan.cha...@enterprisedb.com> wrote:
> Attached new patch set having HEAD at 84ad4b0 with all these review points
> fixed. Let me know if I missed any thanks.

I've only really skimmed over this thread and only opened the code
enough to extract the following:

+ /* Multiply the costs by partition_wise_agg_cost_factor. */
+ apath->startup_cost *= partition_wise_agg_cost_factor;
+ apath->total_cost *= partition_wise_agg_cost_factor;

I've not studied how all the path plumbing is done, but I think
instead of doing this costing magic we should really stop pretending
that Append/MergeAppend nodes are cost-free. I think something like
charging cpu_tuple_cost per row expected through Append/MergeAppend
would be a better approach to this.

If you perform grouping or partial grouping before the Append, then in
most cases the Append will receive less rows, so come out cheaper than
if you perform the grouping after it. I've not learned the
partition-wise join code enough to know if this is going to affect
that too, but for everything else, there should be no plan change,
since there's normally no alternative paths. I see there's even a
comment in create_append_path() which claims the zero cost is a bit
optimistic.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread David Rowley
On 9 October 2017 at 17:41, David Rowley <david.row...@2ndquadrant.com> wrote:
> Thoughts?

Actually, I was a little inconsistent with my List NULL/NIL checks in
that last one.

I've attached an updated patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_left_join_distinct_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-08 Thread David Rowley
On 7 October 2017 at 14:48, Andres Freund <and...@anarazel.de> wrote:
> 3. JOIN Elimination
>
> There's been a lot of discussion and several patches. There's a bunch of
> problems here, one being that there's cases (during trigger firing,
> before the constraint checks) where foreign keys don't hold true, so we
> can't quite generally make these optimization.  Another concern is
> whether the added plan time is actually worthwhile.

I looked over this and it seems there's some pretty low hanging fruit
in there that we can add with just a handful of lines of new code.

This is the case for LEFT JOINs with a DISTINCT clause. Normally we
can only remove an unused LEFT JOINed relation if there are some
unique properties that ensure that the join does not duplicate any
outer row. We don't need to worry about this when there's a DISTINCT
clause, as the DISTINCT would remove any duplicates anyway. If I'm not
mistaken, we also don't need to bother looking at the actual distinct
clause's exprs since we'll already know that nothing is in there
regarding the relation being removed. The benefit to this could be
two-fold, as 1) we don't need to join to the unused relation and 2) we
don't need to remove any row duplication caused by the join.

While someone might argue that this is not going to be that common a
case to hit, if we consider how cheap this is to implement, it does
seem to be worth doing a couple of NULL checks in the planner for it.

The only slight downside I can see to this is that we're letting a few
more cases through rel_supports_distinctness() which is also used for
unique joins, and these proofs are not valid in those. However, it may
not be worth the trouble doing anything about that as relations
without unique indexes are pretty uncommon (at least in my world).

Thoughts?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_left_join_distinct.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Discussion on missing optimizations

2017-10-07 Thread David Rowley
On 7 October 2017 at 15:19, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> 9. Unneeded Self JOIN
>
>> Can't remember discussions of this.
>
> I can't get very excited about that one either.
>
> In the end, what the article fails to consider is that all of these are
> tradeoffs, not unalloyed goods.  If you spend planner cycles on every
> query to look for cases that only the most unabashedly brain-dead ORMs
> ever generate, you're not really doing your users a favor on balance.

I think that final sentence lacks imagination.

I've seen plenty of views being called where some column is
unavailable, but the caller joins the very same table again on the
primary key to add the column. There was no brain-dead ORM involved,
just perhaps questionable design. This was very common in my last job
where we had some rats nest of views several layers deep, the core of
which often had some complex logic that nobody dared to try and
replicate.

It would be fairly cheap to check if any of the rtekind==RTE_RELATION
joinlist  items have above 1 RangeTblEntry with the same relid.  The
joinlist is never that big anyway, and if it was the join search would
be slow. The more expensive part would be to build the join clauses,
check if the expressions on either side of all OpExpr matches and that
nothing else will cause a non-match, then perform the uniqueness check
on those OpExpr operands. We do have some infrastructure to do the
unique checks.  Likely the slowdown in planning would be just for
cases with a legitimately useful self-join, I doubt checking for a
duplicate RangeTblEntry->relid would cause much of a concern.

Anyway, this is giving me some feeling of Deja vu.. Perhaps we need
some pg_stats view that shows us planning time and execution time so
that we can get a better idea on how much these things matter in the
average case. We tend to never fare so well in these sorts of
comparisons with commercial databases. It's not hard to imagine
someone with migration plans loading some rats nest of views into
Postgres and taking it for a spin and finding performance is not quite
what they need. It's a bit sad that often the people with the loudest
voices are always so fast to stomp on the ideas for improvements. It
would be much nicer if you'd at least wait for benchmarks before
shooting.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JIT compiling - v4.0

2017-10-05 Thread David Rowley
On 5 October 2017 at 19:57, Andres Freund <and...@anarazel.de> wrote:
> Here's some numbers for a a TPC-H scale 5 run. Obviously the Q01 numbers
> are pretty nice in partcular. But it's also visible that the shorter
> query can loose, which is largely due to the JIT overhead - that can be
> ameliorated to some degree, but JITing obviously isn't always going to
> be a win.

It's pretty exciting to see thing being worked on.

I've not looked at the code, but I'm thinking, could you not just JIT
if the total cost of the plan is estimated to be > X ? Where X is some
JIT threshold GUC.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-28 Thread David Rowley
On 27 September 2017 at 14:22, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> - 0001 includes refactoring that Dilip proposed upthread [1] (added him as
>   an author).  I slightly tweaked his patch -- renamed the function
>   get_matching_clause to match_clauses_to_partkey, similar to
>   match_clauses_to_index.

Hi Amit,

I've made a pass over the 0001 patch while trying to get myself up to
speed with the various developments that are going on in partitioning
right now.

These are just my thoughts from reading over the patch. I understand
that there's quite a bit up in the air right now about how all this is
going to work, but I have some thoughts about that too, which I
wouldn't mind some feedback on as my line of thought may be off.

Anyway, I will start with some small typos that I noticed while reading:

partition.c:

+ * telling what kind of NullTest has been applies to the corresponding

"applies" should be "applied"

plancat.c:

* might've occurred to satisfy the query.  Rest of the fields are set in

"Rest of the" should be "The remaining"

Any onto more serious stuff:

allpath.c:

get_rel_partitions()

I wonder if this function does not belong in partition.c. I'd have
expected a function to exist per partition type, RANGE and LIST, I'd
imagine should have their own function in partition.c to eliminate
partitions
which cannot possibly match, and return the remainder. I see people
speaking of HASH partitioning, but we might one day also want
something like RANDOM or ROUNDROBIN (I'm not really kidding, imagine
routing records to be processed into foreign tables where you never
need to query them again). It would be good if we could easily expand
this list and not have to touch files all over the optimizer to do
that. Of course, there would be other work to do in the executor to
implement any new partitioning method too.

I know there's mention of it somewhere about get_rel_partitions() not
being so smart about WHERE partkey > 20 AND partkey > 10, the code
does not understand what's more restrictive. I think you could
probably follow the same rules here that are done in
eval_const_expressions(). Over there I see that evaluate_function()
will call anything that's not marked as volatile. I'd imagine, for
each partition key, you'd want to store a Datum with the minimum and
maximum possible value based on the quals processed. If either the
minimum or maximum is still set to NULL, then it's unbounded at that
end. If you encounter partkey = Const, then minimum and maximum can be
set to the value of that Const. From there you can likely ignore any
other quals for that partition key, as if the query did contain
another qual with partkey = SomeOtherConst, then that would have
become a gating qual during the constant folding process. This way if
the user had written WHERE partkey >= 1 AND partkey <= 1 the
evaluation would end up the same as if they'd written WHERE partkey =
1 as the minimum and maximum would be the same value in both cases,
and when those two values are the same then it would mean just one
theoretical binary search on a partition range to find the correct
partition instead of two.

I see in get_partitions_for_keys you've crafted the function to return
something to identify which partitions need to be scanned. I think it
would be nice to see a special element in the partition array for the
NULL and DEFAULT partition. I imagine 0 and 1, but obviously, these
would be defined constants somewhere. The signature of that function
is not so pretty and that would likely tidy it up a bit. The matching
partition indexes could be returned as a Bitmapset, yet, I don't
really see any code which handles adding the NULL and DEFAULT
partition in get_rel_partitions() either, maybe I've just not looked
hard enough yet...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-26 Thread David Rowley
On 25 September 2017 at 23:04, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> By the way, I'm now rebasing these patches on top of [1] and will try to
> merge your refactoring patch in some appropriate way.  Will post more
> tomorrow.
>
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9140cf826

Yeah, I see 0001 conflicts with that. I'm going to set this to waiting
on author while you're busy rebasing this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Remove useless joins (VARCHAR vs TEXT)

2017-09-16 Thread David Rowley
On 17 September 2017 at 08:07, Kim Rose Carlsen <k...@hiper.dk> wrote:
> It seems there are some difference in VARCHAR vs TEXT when postgres tries to
> decide if a LEFT JOIN is useful or not. I can't figure out if this is
> intentional because there are some difference between TEXT and VARCHAR that
> I dont know about or if it's a bug.
>
>
> I would expect both examples to produce same query plan
>
>
> a)
>
> create table a (id varchar primary key);
> create table b (id varchar primary key);
>
> explain   select a.*
>  from a
> left join (select distinct id from b) as b
>on a.id = b.id;
>
>
> QUERY PLAN
> --
>  Hash Right Join  (cost=67.60..113.50 rows=1360 width=32)
>Hash Cond: ((b.id)::text = (a.id)::text)
>->  HashAggregate  (cost=27.00..40.60 rows=1360 width=32)
>  Group Key: b.id
>  ->  Seq Scan on b  (cost=0.00..23.60 rows=1360 width=32)
>->  Hash  (cost=23.60..23.60 rows=1360 width=32)
>  ->  Seq Scan on a  (cost=0.00..23.60 rows=1360 width=32)
> (7 rows)
>
> b)
>
> create table a (id text primary key);
>
> create table b (id text primary key);
>
> explain   select a.*
>  from a
> left join (select distinct id from b) as b
>on a.id = b.id;
>
>   QUERY PLAN
> --
>  Seq Scan on a  (cost=0.00..23.60 rows=1360 width=32)

Yeah, it looks like the code to check for distinctness in the subquery
fails to consider that the join condition may contain RelabelTypes
instead of plain Vars.

The join would be removed if you'd written:

explain select a.* from a left join b on a.id = b.id;

so really the subquery version should be too.

I'm undecided if this should be classed as a bug or just a missed
optimisation. Certainly, the original code should have done this, so
I'm leaning slightly towards this being a bug.

The attached fixes.

(CC'd -hackers since we're starting to discuss code changes. Further
discussion which includes -hackers should drop the general list)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


join_removal_subquery_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] path toward faster partition pruning

2017-09-14 Thread David Rowley
On 21 August 2017 at 18:37, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> I've been working on implementing a way to perform plan-time
> partition-pruning that is hopefully faster than the current method of
> using constraint exclusion to prune each of the potentially many
> partitions one-by-one.  It's not fully cooked yet though.

I'm interested in seeing improvements in this area, so I've put my
name down to review this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-09-05 Thread David Rowley
On 6 September 2017 at 00:43, Daniel Gustafsson <dan...@yesql.se> wrote:
> This patch was moved to the currently open Commitfest.  Given the above
> comment, is the last patch in this thread still up for review, or are you
> working on a new version?  Just to avoid anyone reviewing an outdated version.

Hi Daniel,

I've not had time to work on a new version yet and I can't imagine
that I will for quite some time.

The idea I had to remove the quadratic search on the list was to add
to or modify equal() in equalfuncs.c to have it return -1, 0, +1 and
use that as a comparison function in a binary search tree. The Btree
would complement List as a way of storing Nodes in no particular
order, but in a way that a Node can be found quickly. There's likely
more than a handful of places in the planner that would benefit from
this already.

It's not a 5-minute patch and it probably would see some resistance,
so won't have time to look at this soon.

If the possibility of this increasing planning time in corner cases is
going to be a problem, then it might be best to return this with
feedback for now and I'll resubmit if I get time later in the cycle.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CONNECTION LIMIT and Parallel Query don't play well together

2017-08-25 Thread David Rowley
On 24 August 2017 at 11:15, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> Here is a slightly updated patch for consideration in the upcoming
> commit fest.

Hi Peter,

I just had a quick glance over this and wondered about 2 things.

1. Why a GUC and not a new per user option so it can be configured
differently for different users? Something like ALTER USER ... WORKER
LIMIT ; perhaps. I mentioned about this up-thread a bit.

2.

+ if (count > max_worker_processes_per_user)
+ {
+ ereport(LOG,
+ (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+ errmsg("too many worker processes for role \"%s\"",
+ GetUserNameFromId(GetUserId(), false;
+ LWLockRelease(BackgroundWorkerLock);
+ return false;

Unless I've misunderstood something, it seems that this is going to
give random errors to users which might only occur when they run
queries against larger tables. Part of why it made sense not to count
workers towards the CONNECTION LIMIT was the fact that we didn't want
to throw these random errors when workers could not be obtained when
we take precautions in other places to just silently have fewer
workers. There's lots of discussions earlier in this thread about this
and I don't think anyone was in favour of queries randomly working
sometimes.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Atomics for heap_parallelscan_nextpage()

2017-08-16 Thread David Rowley
On 17 August 2017 at 01:20, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Looks reasonable. I edited the comments and the variable names a bit, to my
> liking, and committed. Thanks!

Thanks for committing. I've just been catching up on all that went on
while I was sleeping. Thanks for handling the cleanup too.

I'll feel better once pademelon goes green again. From looking at the
latest failure on it, it appears that your swapping of
pg_atomic_write_u64 for pg_atomic_init_u64 should fix this. My
apologies for that mistake.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixup some misusage of appendStringInfo and friends

2017-08-15 Thread David Rowley
On 16 August 2017 at 15:38, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> committed

Thanks!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regression in join selectivity estimations when using foreign keys

2017-06-20 Thread David Rowley
On 20 June 2017 at 07:49, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I'm not totally satisfied that there isn't any case where the smallest
> selectivity hack is appropriate.  In the example you're showing here,
> the FK columns are independent so that we get more or less the right
> answer with or without the FK.  But in some quick testing I could not
> produce a counterexample proving that that heuristic is helpful;
> so for now let's can it.
>
> Thanks, and sorry again for the delay.

Many thanks for taking the time on this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perfomance bug in v10

2017-06-01 Thread David Rowley
On 2 June 2017 at 03:46, Teodor Sigaev <teo...@sigaev.ru> wrote:
> I miss here why could the presence of index influence on that? removing
> index causes a good plan although it isn't used in both plans .

Unique indexes are used as proofs when deciding if a join to the
relation is "inner_unique". A nested loop unique join is costed more
cheaply than a non-unique one since we can skip to the next outer
tuple once we've matched the current outer tuple to an inner tuple. In
theory that's half as many comparisons for a non-parameterised nested
loop.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perfomance bug in v10

2017-05-31 Thread David Rowley
On 1 June 2017 at 04:16, Teodor Sigaev <teo...@postgrespro.ru> wrote:
> I found an example where v10 chooses extremely non-optimal plan:
> select
> i::int as a,
> i::int + 1 as b,
> 0 as c
> into t
> from
> generate_series(1,32) as i;
>
> create unique index i on t (c, a);
>
> explain analyze
> SELECT
> t1.a, t1.b,
> t2.a, t2.b,
> t3.a, t3.b,
> t4.a, t4.b,
> t5.a, t5.b,
> t6.a, t6.b
> /*
> ,
> t7.a, t7.b,
> t8.a, t8.b,
> t9.a, t9.b,
> t10.a, t10.b
> */
> FROM t T1
> LEFT OUTER JOIN t T2
> ON T1.b = T2.a AND T2.c = 0
> LEFT OUTER JOIN t T3
> ON T2.b = T3.a AND T3.c = 0
> LEFT OUTER JOIN t T4
> ON T3.b = T4.a AND T4.c = 0
> LEFT OUTER JOIN t T5
> ON T4.b = T5.a AND T5.c = 0
> LEFT OUTER JOIN t T6
> ON T5.b = T6.a AND T6.c = 0
> LEFT OUTER JOIN t T7
> ON T6.b = T7.a AND T7.c = 0
> LEFT OUTER JOIN t T8
> ON T7.b = T8.a AND T8.c = 0
> LEFT OUTER JOIN t T9
> ON T8.b = T9.a AND T9.c = 0
> LEFT OUTER JOIN t T10
> ON T9.b = T10.a AND T10.c = 0
> WHERE T1.c = 0 AND T1.a = 5
> ;

That's pretty unfortunate. It only chooses this plan due to lack of
any useful stats on the table. The planner thinks that a seqscan on t6
with Filter (c = 0) will return 1 row, which is not correct. In the
good plan t1 is the outer rel of the inner most loop. Here the filter
is c = 0 and a = 5, which *does* actually return only 1 row, which
means that all of the other nested loops only execute once, as
predicted.

This is all caused by get_variable_numdistinct() deciding that all
values are distinct because ntuples < DEFAULT_NUM_DISTINCT. I see that
if the example is increased to use 300 tuples instead of 32, then
that's enough for the planner to estimate 2 rows instead of clamping
to 1, and the bad plan does not look so good anymore since the planner
predicts that those nested loops need to be executed more than once.

I really think the planner is too inclined to take risks by nesting
Nested loops like this, but I'm not all that sure the best solution to
fix this, and certainly not for beta1.

So, I'm a bit unsure exactly how best to deal with this.  It seems
like we'd better make some effort, as perhaps this could be a case
that might occur when temp tables are used and not ANALYZED, but the
only way I can think to deal with it is not to favour unique inner
nested loops in the costing model.  The unfortunate thing about not
doing this is that the planner will no longer swap the join order of a
2-way join to put the unique rel on the inner side. This is evident by
the regression test failures caused by patching with the attached,
which changes the cost model for nested loops back to what it was
before unique joins.

My other line of thought is just not to bother doing anything about
this. There's plenty more queries you could handcraft to trick the
planner into generating a plan that'll blow up like this. Is this a
realistic enough one to bother accounting for? Did it come from a real
world case? else, how did you stumble upon it?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


dont_reduce_cost_of_inner_unique_nested_loops.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regression in join selectivity estimations when using foreign keys

2017-05-23 Thread David Rowley
On 22 May 2017 at 16:10, David Rowley <david.row...@2ndquadrant.com> wrote:
> I also just noticed that I don't think I've got ANTI join cases
> correct in the patch I sent. I'll look at that now.

I've attached an updated patch.

This one is much less invasive than my original attempt.

There are two fundamental changes here:

1) OUTER joins now use the foreign key as proof that the join
condition must match.
2) selectivity of nullfrac for null valued columns for OUTER joins is
no longer taken into account. This is now consistent with INNER joins,
which might not be perfect, but it's less surprising. If this is a
problem then we can consider applying something like my 0002 patch
above, however that can mean that nulls are double counted if there
are any other strict clauses which are not part of the foreign key
constraint, so that idea is not perfect either.

In addition to those two things, the poor selectivity estimation in my
original email is also fixed.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


fk_join_est_fix_2017-05-23.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regression in join selectivity estimations when using foreign keys

2017-05-21 Thread David Rowley
On 21 May 2017 at 07:56, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I'm entirely unconvinced by this patch --- it seems to simply be throwing
> away a lot of logic.  Notably it lobotomizes the FK code altogether for
> semi/antijoin cases, but you've not shown any example that even involves
> such joins, so what's the argument for doing that?  Also, the reason
> we had the use_smallest_selectivity code in the first place was that we
> didn't believe the FK-based selectivity could be applied safely to
> outer-join cases, so simply deciding that it's OK to apply it after all
> seems insufficiently justified.

Originally I looked at just multiplying the selectivities in
use_smallest_selectivity, but on further thought, I didn't manage to
see why using foreign keys to assist with selectivity estimations when
the ref_is_outer is true. We still have a very high probability that
the outer rel contains a tuple to match each inner rel's tuples
(because inner references outer). The only difference between OUTER
and INNER typed joins is that OUTER will produce a bunch of NULL rows
in place of non-matched inner rows. That part seems to be handled in
calc_joinrel_size_estimate() by code such as:

 if (nrows < outer_rows)
nrows = outer_rows;

We could do much better than what we have today by also adding
outer_rows - (n_distinct inner rows on referencing Vars), to also
account for those unmatched rows. However, I'd say that's not for
back-patching, although it may be especially good now that we have
multi-variate ndistinct in PG10.

> Or in short, exhibiting one counterexample to the existing code is not
> a sufficient argument for changing things this much.  You need to give
> an argument why this is the right thing to do instead.
>
> Stepping back a bit, it seems like the core thing the FK selectivity code
> was meant to do was to prevent us from underestimating selectivity of
> multiple-clause join conditions through a false assumption of clause
> independence.  The problem with the use_smallest_selectivity code is that
> it's overestimating selectivity, but that doesn't mean that we want to go
> all the way back to the old way of doing things.  I wonder if we could get
> any traction in these dubious cases by computing the product, instead of
> minimum, of the clause selectivities (thus matching the old estimate) and
> then taking the greater of that and the FK-based selectivity.

My concern with going back to selectivity multiplication was exactly
because I didn't want to go back to the original undesired behaviour
of underestimation when conditions are co-related. I'm unsure why
taking the greater is any better than just using the foreign key
selectivity. It seems senseless to use clause based selectivities at
all when we've proved the foreign key exists -- there will be no
unmatched inner tuples.

The reason I dropped the non-singleton join for SEMI and ANTI is that
I couldn't see how we could make any improvements over the original
join estimation code for this case. Perhaps I'm assuming too much
about how get_foreign_key_join_selectivity() is used by doing this?
I'm assuming that leaving the clause list intact and referring the
whole case to calc_joinrel_size_estimate() is okay to do.

The reason I added the nullfrac code in 0002 was because I was
concerned with regressing OUTER join cases where the nulfrac works due
to using the clause_selectivity(). Although this case regressed
between 9.5 and 9.6 for INNER joins with a non-zero nullfrac on the
join condition.

In short; if we replace the use_smallest_selectivity code with fkselec
*= clauselist_selectivity(root, removedlist, 0, jointype, sjinfo);
then I'm worried about regressing back to the old underestimations.
The extended stats code in PG10's version of clauselist_selectivity()
will ignore these join conditions, so nothing can be done to assist
the underestmations by adding extended stats yet.

I also just noticed that I don't think I've got ANTI join cases
correct in the patch I sent. I'll look at that now.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Regression in join selectivity estimations when using foreign keys

2017-05-19 Thread David Rowley
On 18 May 2017 at 20:28, David Rowley <david.row...@2ndquadrant.com> wrote:
> A vastly simplified example case is:
>
> create table fkest (a int, b int, c int unique, primary key(a,b));
> create table fkest1 (a int, b int, primary key(a,b));
>
> insert into fkest select x/10,x%10, x from generate_Series(1,400) x;
> insert into fkest1 select x/10,x%10 from generate_Series(1,400) x;
>
> alter table fkest1 add constraint fkest1_a_b_fkey foreign key (a,b)
> references fkest;
>
> analyze fkest;
> analyze fkest1;
>
> explain (costs on) select * from fkest f
> left join fkest1 f1 on f.a = f1.a and f.b = f1.b
> left join fkest1 f2 on f.a = f2.a and f.b = f2.b
> left join fkest1 f3 on f.a = f3.a and f.b = f3.b
> where f.c = 1;

I should have shown the EXPLAIN ANALYZE of this instead.


QUERY PLAN
--
 Hash Left Join  (cost=24.15..41.89 rows=996 width=36) (actual
time=0.430..0.463 rows=1 loops=1)
   Hash Cond: ((f.a = f3.a) AND (f.b = f3.b))
   ->  Hash Left Join  (cost=12.15..28.36 rows=100 width=28) (actual
time=0.255..0.288 rows=1 loops=1)
 Hash Cond: ((f.a = f2.a) AND (f.b = f2.b))
 ->  Nested Loop Left Join  (cost=0.15..16.21 rows=10
width=20) (actual time=0.046..0.079 rows=1 loops=1)
   ->  Seq Scan on fkest f  (cost=0.00..8.00 rows=1
width=12) (actual time=0.013..0.045 rows=1 loops=1)
 Filter: (c = 1)
 Rows Removed by Filter: 399
   ->  Index Only Scan using fkest1_pkey on fkest1 f1
(cost=0.15..8.17 rows=1 width=8) (actual time=0.031..0.031 rows=1
loops=1)
 Index Cond: ((a = f.a) AND (b = f.b))
 Heap Fetches: 1
 ->  Hash  (cost=6.00..6.00 rows=400 width=8) (actual
time=0.180..0.180 rows=400 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on fkest1 f2  (cost=0.00..6.00 rows=400
width=8) (actual time=0.006..0.041 rows=400 loops=1)
   ->  Hash  (cost=6.00..6.00 rows=400 width=8) (actual
time=0.162..0.162 rows=400 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on fkest1 f3  (cost=0.00..6.00 rows=400 width=8)
(actual time=0.010..0.040 rows=400 loops=1)
 Planning time: 0.409 ms
 Execution time: 0.513 ms
(19 rows)

which you can obviously see the poor estimate propagating to up the plan tree.

If we add another left join the final estimate is even worse:

explain analyze select * from fkest f
left join fkest1 f1 on f.a = f1.a and f.b = f1.b
left join fkest1 f2 on f.a = f2.a and f.b = f2.b
left join fkest1 f3 on f.a = f3.a and f.b = f3.b
left join fkest1 f4 on f.a = f4.a and f.b = f4.b where f.c = 1;

QUERY PLAN

 Hash Left Join  (cost=36.15..69.06 rows=9915 width=44) (actual
time=0.535..0.569 rows=1 loops=1)
   Hash Cond: ((f.a = f4.a) AND (f.b = f4.b))
   ->  Hash Left Join  (cost=24.15..41.89 rows=996 width=36) (actual
time=0.371..0.404 rows=1 loops=1)
 Hash Cond: ((f.a = f3.a) AND (f.b = f3.b))
 ->  Hash Left Join  (cost=12.15..28.36 rows=100 width=28)
(actual time=0.208..0.241 rows=1 loops=1)
   Hash Cond: ((f.a = f2.a) AND (f.b = f2.b))
   ->  Nested Loop Left Join  (cost=0.15..16.21 rows=10
width=20) (actual time=0.029..0.062 rows=1 loops=1)
 ->  Seq Scan on fkest f  (cost=0.00..8.00 rows=1
width=12) (actual time=0.014..0.047 rows=1 loops=1)
   Filter: (c = 1)
   Rows Removed by Filter: 399
 ->  Index Only Scan using fkest1_pkey on fkest1
f1  (cost=0.15..8.17 rows=1 width=8) (actual time=0.012..0.012 rows=1
loops=1)
   Index Cond: ((a = f.a) AND (b = f.b))
   Heap Fetches: 1
   ->  Hash  (cost=6.00..6.00 rows=400 width=8) (actual
time=0.168..0.168 rows=400 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on fkest1 f2  (cost=0.00..6.00
rows=400 width=8) (actual time=0.008..0.043 rows=400 loops=1)
 ->  Hash  (cost=6.00..6.00 rows=400 width=8) (actual
time=0.156..0.156 rows=400 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on fkest1 f3  (cost=0.00..6.00 rows=400
width=8) (actual time=0.006..0.035 rows=400 loops=1)
   ->  Hash  (cost=6.00..6.00 rows=400 width=8) (actual
time=0.155..0.155 rows=400 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on fkest1 f4  (cost=0.00..6.00 rows=400 width=8)
(actual time=0.004..0.034 rows=400 loops=1)
 Planning time: 0.864 ms

[HACKERS] Regression in join selectivity estimations when using foreign keys

2017-05-18 Thread David Rowley
000 width=4) (actual
time=0.336..0.336 rows=1000 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 44kB
 ->  Seq Scan on part p  (cost=0.00..15.00 rows=1000 width=4)
(actual time=0.014..0.132 rows=1000 loops=1)
 Planning time: 0.468 ms
 Execution time: 0.787 ms
(8 rows)

alter table sale drop constraint sale_partid_fkey; -- simulate pre-9.6
behaviour by dropping the fkey

explain analyze select * from part p inner join sale s on p.partid =
s.partid; -- normal join estimations.
QUERY PLAN
---
 Hash Join  (cost=27.50..55.12 rows=1 width=12) (actual
time=0.546..0.546 rows=0 loops=1)
   Hash Cond: (s.partid = p.partid)
   ->  Seq Scan on sale s  (cost=0.00..15.00 rows=1000 width=8)
(actual time=0.014..0.102 rows=1000 loops=1)
   ->  Hash  (cost=15.00..15.00 rows=1000 width=4) (actual
time=0.387..0.387 rows=1000 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 44kB
 ->  Seq Scan on part p  (cost=0.00..15.00 rows=1000 width=4)
(actual time=0.009..0.110 rows=1000 loops=1)
 Planning time: 0.278 ms
 Execution time: 0.572 ms

I've attached fixes, based on master, for both of these cases.

Patches:

0001: Is a minimal fix for the repored regression, but may cause
further regression as it does nothing to account for NULLs valued
columns in outer joins, but at least it is consistent with how INNER
joins are estimated regarding NULLs.

0002: Adds further code to apply the nullfrac from pg_statistics.

I've given this a round of testing and I can't find any case where it
gives worse estimates than the standard join estimation as seen when
you drop the foreign key, but I'd like to challenge someone else to
give it a go and see if they can Not because I'm super confident,
more just because I want this to be correct.

Thoughts?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0001-Minimal-fix-for-foreign-key-join-estimations.patch
Description: Binary data


0002-Apply-nullfrac-during-foreign-key-join-estimations.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Pulling up more complicated subqueries

2017-05-17 Thread David Rowley
On 18 May 2017 at 04:30, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, May 17, 2017 at 11:08 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
>> That's not a straight semi-join, but we could still turn it into a new kind
>> of LEFT-SEMI join. A left-semi join is like a left join, in that it returns
>> all rows from the left side, and NULLs for any non-matches. And like a
>> semi-join, it returns only one matching row from the right-side, if there
>> are duplicates. In the qual, replace the SubLink with an IS NOT NULL test.
> ...
>> This can be implemented using yet another new join type, a LEFT-UNIQUE join.
>> It's like a LEFT JOIN, but it must check that there are no duplicates in the
>> right-hand-side, and throw an error if there are (ERROR:  more than one row
>> returned by a subquery used as an expression).
>
> It seems like we might want to split what is currently called JoinType
> into two separate things -- one that is INNER/LEFT/RIGHT/FULL and the
> other that says what to do about multiple matches, which could be that
> they are expected, they are to be ignored (as in your LEFT-SEMI case),
> or they should error out (as in your LEFT-UNIQUE case).

I just wanted to mention that I almost got sucked down that hole with
unique joins. Instead, I'd recommend following the pattern of passing
bool flags down to the executor from the planner just like what is
done for inner_unique in, say make_hashjoin()

I did change unique joins at one point to overload the JoinType, but
it was a much more scary thing to do as there's lots of code in the
planner that does special things based on the join type, and the
footprint of your patch and risk factor start to grow pretty rapidly
once you do that.

I mention something around this in [1].

[1] 
https://www.postgresql.org/message-id/CAKJS1f_jRki1PQ4X-9UGKa-wnBhECQLnrxCX5haQzu4SDR_r2Q%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CTE inlining

2017-05-15 Thread David Rowley
On 13 May 2017 at 08:39, Bruce Momjian <br...@momjian.us> wrote:
> To summarize, it seems we have two options if we want to add fence
> control to CTEs:
>
> 1.  add INLINE to disable the CTE fence
> 2.  add MATERIALIZE to enable the CTE fence

I think #1 is out of the question.

What would we do in cases like:

WITH INLINE cte AS (SELECT random() a)
SELECT * FROM cte UNION SELECT * FROM cte;

I assume we won't want to inline when the CTE query contains a
volatile function, and we certainly won't in cases like:

WITH INLINE cte AS (DELETE FROM a RETURNING *)
INSERT INTO b SELECT * from cte WHERE cte.value > 5;

We'd be certain to receive complaints from disgruntled users about
"Why is this not inlined when I specified INLINE?"

#2 does not suffer from that.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warning in costsize.c

2017-05-07 Thread David Rowley
On 11 April 2017 at 12:53, Michael Paquier <michael.paqu...@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 4:02 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Mon, Apr 10, 2017 at 2:09 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
>>>> Why bother with the 'rte' variable at all if it's only used for the
>>>> Assert()ing the rtekind?
>>>
>>> That was proposed a few messages back.  I don't like it because it makes
>>> these functions look different from the other scan-cost-estimation
>>> functions, and we'd just have to undo the "optimization" if they ever
>>> grow a need to reference the rte for another purpose.
>>
>> I think that's sort of silly, though.  It's a trivial difference,
>> neither likely to confuse anyone nor difficult to undo.
>
> +1. I would just do that and call it a day. There is no point to do a
> mandatory list lookup as that's just for an assertion, and fixing this
> warning does not seem worth the addition of fancier facilities. If the
> function declarations were doubly-nested in the code, I would
> personally consider the use of a variable, but not here.

Any more thoughts on what is acceptable for fixing this? beta1 is
looming and it seems a bit messy to be shipping that with these
warnings, however harmless they are.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-07 Thread David Rowley
On 6 May 2017 at 13:44, Thomas Munro <thomas.mu...@enterprisedb.com> wrote:
> In Linux, each process that opens a file gets its own 'file'
> object[1][5].  Each of those has it's own 'file_ra_state'
> object[2][3], used by ondemand_readahead[4] for sequential read
> detection.  So I speculate that page-at-a-time parallel seq scan must
> look like random access to Linux.
>
> In FreeBSD the situation looks similar.  Each process that opens a
> file gets a 'file' object[8] which has members 'f_seqcount' and
> 'f_nextoff'[6].  These are used by the 'sequential_heuristics'
> function[7] which affects the ioflag which UFS/FFS uses to control
> read ahead (see ffs_read).  So I speculate that page-at-a-time
> parallel seq scan must look like random access to FreeBSD too.
>
> In both cases I suspect that if you'd inherited (or sent the file
> descriptor to the other process via obscure tricks), it would actually
> work because they'd have the same 'file' entry, but that's clearly not
> workable for md.c.
>

Interesting!

> Experimentation required...

Indeed. I do remember long discussions on this before Parallel seq
scan went in, but I don't recall if anyone checked any OS kernels to
see what they did.

We really need a machine with good IO concurrency, and not too much
RAM to test these things out. It could well be that for a suitability
large enough table we'd want to scan a whole 1GB extent per worker.

I did post a patch to have heap_parallelscan_nextpage() use atomics
instead of locking over in [1], but I think doing atomics there does
not rule out also adding batching later. In fact, I think it
structures things so batching would be easier than it is today.

[1] 
https://www.postgresql.org/message-id/CAKJS1f9tgsPhqBcoPjv9_KUPZvTLCZ4jy=B=bhqgakn7cyz...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-07 Thread David Rowley
On 5 May 2017 at 14:54, Thomas Munro <thomas.mu...@enterprisedb.com> wrote:
> Just for fun, check out pages 42 and 43 of Wei Hong's thesis.  He
> worked on Berkeley POSTGRES parallel query and a spin-off called XPRS,
> and they got linear seq scan scaling up to number of spindles:
>
> http://db.cs.berkeley.edu/papers/ERL-M93-28.pdf

That's interesting. I'd no idea that work was done. Actually, I didn't
really know that anyone had thought to have more than one processor
back then :-)

And I also now know the origins of the tenk1 table in the regression
database. Those 10,000 rows were once used for benchmarking! I'm glad
we're all using a couple more rows these days.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Atomics for heap_parallelscan_nextpage()

2017-05-06 Thread David Rowley
Hi,

A while back I did some benchmarking on a big 4 socket machine to
explore a bit around the outer limits of parallel aggregates.  I
discovered along the way that, given enough workers, and a simple
enough task, that seq-scan workers were held up waiting for the lock
to be released in heap_parallelscan_nextpage().

I've since done a little work in this area to try to improve things. I
ended up posting about it yesterday in [1]. My original patch used
batching to solve the issue; instead of allocating 1 block at a time,
the batching patch allocated a range of 10 blocks for the worker to
process. However, the implementation still needed a bit of work around
reporting sync-scan locations.

Andres mentioned in [2] that it might be worth exploring using atomics
to do the same job. So I went ahead and did that, and came up with the
attached, which is a slight variation on what he mentioned in the
thread.

To keep things a bit more simple, and streamline, I ended up pulling
out the logic for setting the startblock into another function, which
we only call once before the first call to
heap_parallelscan_nextpage().  I also ended up changing phs_cblock and
replacing it with a counter that always starts at zero. The actual
block is calculated based on that + the startblock modulo nblocks.
This makes things a good bit more simple for detecting when we've
allocated all the blocks to the workers, and also works nicely when
wrapping back to the start of a relation when we started somewhere in
the middle due to piggybacking with a synchronous scan.

Performance:

With parallel_workers=71, it looks something like:

Query 1: 881 GB, ~6 billion row TPC-H lineitem table.

tpch=# select count(*) from lineitem;
   count

 589709
(1 row)

-- Master
Time: 123421.283 ms (02:03.421)
Time: 118895.846 ms (01:58.896)
Time: 118632.546 ms (01:58.633)

-- Atomics patch
Time: 74038.813 ms (01:14.039)
Time: 73166.200 ms (01:13.166)
Time: 72492.338 ms (01:12.492)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 76364.215 ms (01:16.364)
Time: 75808.900 ms (01:15.809)
Time: 74927.756 ms (01:14.928)

Query 2: Single int column table with 2 billion rows.

tpch=# select count(*) from a;
   count

 20
(1 row)

-- Master
Time: 5853.918 ms (00:05.854)
Time: 5925.633 ms (00:05.926)
Time: 5859.223 ms (00:05.859)

-- Atomics patch
Time: 5825.745 ms (00:05.826)
Time: 5849.139 ms (00:05.849)
Time: 5815.818 ms (00:05.816)

-- Batching Patch: Batching 10 pages at a time in heap_parallelscan_nextpage()
Time: 5789.237 ms (00:05.789)
Time: 5837.395 ms (00:05.837)
Time: 5821.492 ms (00:05.821)

I've also attached a text file with the perf report for the lineitem
query. You'll notice that the heap_parallelscan_nextpage() is very
visible in master, but not on each of the two patches.

With the 2nd query, heap_parallelscan_nextpage is fairly insignificant
on master's profile, it's only showing up as 0.48%. Likely this must
be due to more tuples being read from the page, and more aggregation
work getting done before the next page is needed. I'm uncertain why I
previously saw a speed up in this case in [1].

I've also noticed that both the atomics patch and unpatched master do
something that looks a bit weird with synchronous seq-scans. If the
parallel seq-scan piggybacked on another scan, then subsequent
parallel scans will start at the same non-zero block location, even
when no other concurrent scans exist. I'd have expected this should go
back to block 0 again, but maybe I'm just failing to understand the
reason for reporting the startblock to ss_report_location() at the end
of the scan.

I'll now add this to the first commitfest of pg11. I just wanted to
note that I've done this, so that it's less likely someone else goes
and repeats the same work.

[1] 
https://www.postgresql.org/message-id/CAKJS1f-XhfQ2-%3D85wgYo5b3WtEs%3Dys%3D2Rsq%3DNuvnmaV4ZsM1XQ%40mail.gmail.com

[2] 
https://www.postgresql.org/message-id/20170505023646.3uhnmf2hbwtm63lc%40alap3.anarazel.de

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
-- Unpatched select count(*) from lineitem; 71 workers
  Children  Self  Command   Shared Object Symbol
+   99.83% 0.00%  postgres  libpthread-2.17.so[.] __restore_rt
+   99.83% 0.00%  postgres  postgres  [.] sigusr1_handler
+   99.83% 0.00%  postgres  postgres  [.] maybe_start_bgworkers
+   99.83% 0.00%  postgres  postgres  [.] do_start_bgworker
+   99.83% 0.93%  postgres  postgres  [.] ExecProcNode
+   99.83% 0.00%  postgres  postgres  [.] StartBackgroundWorker
+   99.83% 0.00%  postgres  postgres  [.] ParallelWorkerMain
+   99.83% 0.00%  postgres  postgres  [.] ParallelQueryMain
+   99.83% 0.00%  postgres  postgres  [.] ExecutorRun
+   99.83% 

Re: [HACKERS] CTE inlining

2017-05-05 Thread David Rowley
On 5 May 2017 at 14:04, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Craig Ringer <craig.rin...@2ndquadrant.com> writes:
>> We're carefully maintaining this bizarre cognitive dissonance where we
>> justify the need for using this as a planner hint at the same time as
>> denying that we have a hint. That makes it hard to make progress here.
>> I think there's fear that we're setting some kind of precedent by
>> admitting what we already have.
>
> I think you're overstating the case.  It's clear that there's a
> significant subset of CTE functionality where there has to be an
> optimization fence.  The initial implementation basically took the
> easy way out by deeming *all* CTEs to be optimization fences.  Maybe
> we shouldn't have documented that behavior, but we did.  Now we're
> arguing about how much of a compatibility break it'd be to change that
> planner behavior.  I don't see any particular cognitive dissonance here,
> just disagreements about the extent to which backwards compatibility is
> more important than better query optimization.

How about we get the ball rolling on this in v10 and pull that part
out of the docs. If anything that'll buy us a bit more wiggle room to
change this in v11.

I've attached a proposed patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


doc_caution_about_cte_changes_in_the_future.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-04 Thread David Rowley
On 5 May 2017 at 14:36, Andres Freund <and...@anarazel.de> wrote:
> I wonder how much doing the atomic ops approach alone can help, that
> doesn't have the issue that the work might be unevenly distributed
> between pages.

I wondered that too, since I though the barrier for making this change
would be lower by doing it that way.

I didn't manage to think of a way to get around the wrapping the
position back to 0 when synch-scans are involved.

i.e
parallel_scan->phs_cblock++;
if (parallel_scan->phs_cblock >= scan->rs_nblocks)
parallel_scan->phs_cblock = 0;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-04 Thread David Rowley
On 3 May 2017 at 07:13, Robert Haas <robertmh...@gmail.com> wrote:
> Multiple people (including David Rowley
> as well as folks here at EnterpriseDB) have demonstrated that for
> certain queries, we can actually use a lot more workers and everything
> works great.  The problem is that for other queries, using a lot of
> workers works terribly.  The planner doesn't know how to figure out
> which it'll be - and honestly, I don't either.

For me, it seems pretty much related to the number of tuples processed
on a worker, vs how many they return. As a general rule, I'd say the
higher this ratio, the higher the efficiency ratio will be for the
worker. Although that's not taking into account contention points
where workers must wait for fellow workers to complete some operation.
I think parallel_tuple_cost is a good GUC to have, perhaps we can be
smarter about the use of it when deciding on how many workers should
be used.

By efficiency, I mean that if a query takes 10 seconds in a normal
serial plan, and adding 1 worker, it takes 5 seconds, it would be 100%
efficient to use another worker. I charted this in [1]. It would have
been interesting to chart the same in a query that returned a larger
number of groups, but I ran out of time, but I think it pretty much
goes, without testing, that more groups == less efficiency. Which'll
be due to more overhead in parallel tuple communication, and more work
to do in the serial portion of the plan.

[1] https://blog.2ndquadrant.com/parallel-monster-benchmark
-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-04 Thread David Rowley
On 5 May 2017 at 13:37, Andres Freund <and...@anarazel.de> wrote:
> On 2017-05-02 15:13:58 -0400, Robert Haas wrote:
>> Multiple people (including David Rowley
>> as well as folks here at EnterpriseDB) have demonstrated that for
>> certain queries, we can actually use a lot more workers and everything
>> works great.  The problem is that for other queries, using a lot of
>> workers works terribly.  The planner doesn't know how to figure out
>> which it'll be - and honestly, I don't either.
>
> Have those benchmarks, even in a very informal form, been shared /
> collected / referenced centrally?  I'd be very interested to know where
> the different contention points are. Possibilities:

I posted mine on [1], although the post does not go into much detail
about the contention points. I only really briefly mention it at the
end.

[1] https://blog.2ndquadrant.com/parallel-monster-benchmark/#comment-248273

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] modeling parallel contention (was: Parallel Append implementation)

2017-05-04 Thread David Rowley
 On 3 May 2017 at 07:13, Robert Haas <robertmh...@gmail.com> wrote:
> It is of course possible that the Parallel Seq Scan could run into
> contention problems if the number of workers is large, but in my
> experience there are bigger problems here.  The non-parallel Seq Scan
> can also contend -- not of course over the shared mutex because there
> isn't one, but over access to the blocks themselves.  Every one of
> those blocks has a content lock and a buffer header and so on, and
> having multiple processes accessing those things at the same time
> scales well, but not perfectly.  The Hash node can also contend: if
> the hash join spills to disk, you've got multiple processes reading
> and writing to the temp directory at the same time and, of course,
> that can be worse than just one process doing it -- sometimes much
> worse.  It can also be better, depending on how much I/O gets
> generated and how much I/O bandwidth you have.

Yeah, I did get some time to look over the contention in Parallel Seq
Scan a while back and I discovered that on the machine that I was
testing on. the lock obtained in heap_parallelscan_nextpage() was
causing workers to have to wait for other workers to fetch their next
task to work on.

I ended up writing the attached (which I'd not intended to post until
some time closer to when the doors open for PG11). At the moment it's
basically just a test patch to see how it affects things when we give
workers a bit more to do before they come back to look for more work.
In this case, I've just given them 10 pages to work on, instead of the
1 that's allocated in 9.6 and v10.

A quick test on a pretty large table on a large machine shows:

Unpatched:

postgres=# select count(*) from a;
   count

 187400
(1 row)

Time: 5211.485 ms (00:05.211)

Patched:

postgres=# select count(*) from a;
   count

 187400
(1 row)

Time: 2523.983 ms (00:02.524)

So it seems worth looking into. "a" was just a table with a single int
column. I'm unsure as yet if there are more gains to be had for tables
with wider tuples. I do suspect the patch will be a bigger win in
those cases, since there's less work to do for each page, e.g less
advance aggregate calls, so likely they'll be looking for their next
page a bit sooner.

Now I'm not going to pretend that this patch is ready for the
prime-time. I've not yet worked out how to properly report sync-scan
locations without risking reporting later pages after reporting the
end of the scan. What I have at the moment could cause a report to be
missed if SYNC_SCAN_REPORT_INTERVAL is not divisible by the batch
size. I'm also not sure how batching like this affect read-aheads, but
at least the numbers above speak for something. Although none of the
pages in this case came from disk.

I'd had thoughts that the 10 pages wouldn't be constant, but the
batching size would depend on the size of the relation to be scanned.
I'd rough ideas to just try to make about 1 million batches. Something
like batch_pages = Max(parallel_scan->phs_nblocks / 100, 1); so
that we only take more than 1 page if there's some decent amount to
process. We don't want to make the batches too big as we might end up
having to wait on slow workers at the end of a scan.

Anyway. I don't want to hi-jack this thread with discussions on this.
I just wanted to mark that I plan to work on this in order to avoid
any repeat developments or analysis. I'll probably start a new thread
for this sometime nearer PG11's dev cycle.

The patch is attached if in the meantime someone wants to run this on
some big hardware.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


parallel_nextpage_batching_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should pg_current_wal_location() become pg_current_wal_lsn()

2017-05-04 Thread David Rowley
On 2 May 2017 at 00:10, David Rowley <david.row...@2ndquadrant.com> wrote:
> On 20 April 2017 at 07:29, Euler Taveira <eu...@timbira.com.br> wrote:
>> 2017-04-19 1:32 GMT-03:00 Michael Paquier <michael.paqu...@gmail.com>:
>>>
>>> I vote for "location" -> "lsn". I would expect complains about the
>>> current inconsistency at some point, and the function names have been
>>> already changed for this release..
>
> OK, so I've created a draft patch which does this.

I ended up adding this to the open items list.  I feel it's best to be
on there so that we don't forget about this.

If we decide not to do it then we can just remove it from the list,
but it would be a shame to release the beta having forgotten to make
this change.

Do any of the committers who voted for this change feel inclined to
pick this patch up?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should pg_current_wal_location() become pg_current_wal_lsn()

2017-05-01 Thread David Rowley
On 20 April 2017 at 07:29, Euler Taveira <eu...@timbira.com.br> wrote:
> 2017-04-19 1:32 GMT-03:00 Michael Paquier <michael.paqu...@gmail.com>:
>>
>> I vote for "location" -> "lsn". I would expect complains about the
>> current inconsistency at some point, and the function names have been
>> already changed for this release..

OK, so I've created a draft patch which does this.

In summary of what it does

1. Renames all wal_location functions to wal_lsn.
2. Renames all system view columns to use "lsn" instead of "location"
3. Rename function parameters to use "lsn" instead of "location".
4. Rename function parameters "wal_position" to "lsn". (Not mentioned
before, but seems consistency was high on the list from earlier
comments on the thread)
5. Change documentation to reflect the above changes.
6. Fix bug where docs claimed return type of
pg_logical_slot_peek_changes.location was text, when it was pg_lsn
(maybe apply separately?)
7. Change some places in the func.sgml where it was referring to the
lsn as a "position" rather than "location". (We want consistency here)

Is this touching all places which were mentioned by everyone?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


location2lsn_rename.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] convert EXSITS to inner join gotcha and bug

2017-04-29 Thread David Rowley
On 29 April 2017 at 15:39, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I'm kind of strongly tempted to apply the second patch; but it would
> be fair to complain that reduce_unique_semijoins() is new development
> and should wait for v11.  Opinions?

My vote is for the non-minimal patch. Of course, I'd be voting for
minimal patch if this was for a minor version release fix, but we're
not even in beta yet for v10. The original patch was intended to fix
cases like this, although I'd failed to realise this particular case.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] convert EXSITS to inner join gotcha and bug

2017-04-28 Thread David Rowley
(On 29 April 2017 at 02:26, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Alexander Korotkov <a.korot...@postgrespro.ru> writes:
>> I've reproduced this bug on d981074c.
>> On default config, after loading example.sql.bz2 and VACUUM ANALYZE, query
>> result is OK.
>> But with seqscan and hashjoin disabled, query returns 0 rows.
>
> Ah, thanks for the clue about enable_hashjoin, because it wasn't
> reproducing for me as stated.
>
> It looks like in the case that's giving wrong answers, the mergejoin
> is wrongly getting marked as "Inner Unique".  Something's a bit too ()
> cheesy about that planner logic --- not sure what, yet.

Seems related to the unconditional setting of extra.inner_unique to
true for JOIN_UNIQUE_INNER jointypes in add_paths_to_joinrel()

Setting this based on the return value of innerrel_is_unique() as done
with the other join types seems to fix the issue.

I don't know yet if that's the correct fix. It's pretty late 'round
this side to be thinking too hard about it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] convert EXSITS to inner join gotcha and bug

2017-04-28 Thread David Rowley
On 29 April 2017 at 00:45, Alexander Korotkov <a.korot...@postgrespro.ru> wrote:
> On default config, after loading example.sql.bz2 and VACUUM ANALYZE, query
> result is OK.

Hi,

Did you mean to attach this?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixup some misusage of appendStringInfo and friends

2017-04-26 Thread David Rowley
On 27 April 2017 at 06:41, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 4/19/17 08:42, Ashutosh Bapat wrote:
>> I reviewed the patch. It compiles clean, make check-world passes. I do
>> not see any issue with it.
>
> Looks reasonable.  Let's keep it for the next commit fest.

Thank you to both of you for looking. I'd thought that maybe the new
stuff in PG10 should be fixed before the release. If we waited, and
fix in PG11 then backpatching is more of a pain.

However, I wasn't careful in the patch to touch only new to PG10 code.

I'll defer to your better judgment and add to the next 'fest.

Thanks

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign Join pushdowns not working properly for outer joins

2017-04-26 Thread David Rowley
On 27 April 2017 at 01:31, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> committed

Great. Thanks!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign Join pushdowns not working properly for outer joins

2017-04-26 Thread David Rowley
On 26 April 2017 at 02:12, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 4/24/17 22:50, Peter Eisentraut wrote:
>> On 4/14/17 00:24, Ashutosh Bapat wrote:
>>> This looks better. Here are patches for master and 9.6.
>>> Since join pushdown was supported in 9.6 the patch should be
>>> backported to 9.6 as well. Attached is the patch (_96) for 9.6,
>>> created by rebasing on 9.6 branch and removing conflict. _v6 is
>>> applicable on master.
>>
>> Committed to PG10.  I'll work on backpatching next.
>
> For backpatching to 9.6, I came up with the attached reduced version.
> Since we don't have add_foreign_grouping_paths() in 9.6, we can omit the
> refactoring and keep the changes much simpler.  Does that make sense?

That makes sense to me. It fixes the reported issue and is less
invasive than the pg10 patch.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect use of ERRCODE_UNDEFINED_COLUMN in extended stats

2017-04-25 Thread David Rowley
On 25 April 2017 at 02:15, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> The attached small patched fixes an incorrect usage of an error code
>> in the extended stats code.
>
> Hmm, looks like all of that could do with an editorial pass (e.g.,
> "duplicity" does not mean what the comments author seems to think).

Was about to look at this, but see you've beaten me to it.

Thanks for committing and for fixing the other stuff too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG 10 release notes

2017-04-24 Thread David Rowley
 ..On 25 April 2017 at 13:31, Bruce Momjian <br...@momjian.us> wrote:
> The only unusual thing is that this release has ~180 items while most
> recent release have had ~220.  The pattern I see that there are more
> large features in this release than previous ones.

Thanks for drafting this up.

I understand that it may have been filtered out, but I'd say that
7e534adcdc70 is worth a mention.

Users creating BRIN indexes previously would have had to know
beforehand that the table would be sufficiently correlated on the
indexed columns for the index to be worthwhile, whereas now there's a
lesser need for the user to know this beforehand.

Also:


New commands are CREATE,
ALTER, and
DROP STATISTICS.
This is helpful in
estimating query memory usage and ... HOW IS RATIO USED?


HOW IS RATIO USED?  There are two types of new stats;

ndistinct, which can improve row estimations in the query planner for
estimating the number of distinct value groups over multiple columns.
functionaldeps, which provides the planner with a better understanding
of functional depdenences between columns on which the statistics are
defined. The planner takes the functional dependency degree into
account when multiplying selectivity estimations for multiple columns.

Unsure how best to trim that down to something short enough for the
release notes.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Incorrect use of ERRCODE_UNDEFINED_COLUMN in extended stats

2017-04-24 Thread David Rowley
The attached small patched fixes an incorrect usage of an error code
in the extended stats code.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


ext_stats_duplicate_column_errorcode_fix.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-04-22 Thread David Rowley
On 22 April 2017 at 21:30, Simon Riggs <si...@2ndquadrant.com> wrote:
> CREATE STATISTICS s1 USING (dependencies, ndistinct) ON (a, b) FROM t1
> WHERE partial-stuff;

+1

Seems much more CREATE INDEX like, and that's pretty good given that
most of the complaints so far were about it bearing enough resemblance
to CREATE INDEX

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH clause in CREATE STATISTICS

2017-04-21 Thread David Rowley
On 21 April 2017 at 22:30, Tels <nospam-pg-ab...@bloodgate.com> wrote:
> I'd rather see:
>
>  CREATE STATISTICS stats_name ON table(col);
>
> as this both mirrors CREATE INDEX and foreign keys with REFERENCES. It
> could also be extended to both more columns, expressions or other tables
> like so:
>
>  CREATE STATISTICS stats ON t1(col1, col2 / 2), t2 (a,b);
>
> and even:
>
>  CREATE STATISTICS stats ON t1(col1, col2 / 2), t2 (a,b) WITH (options)
> WHERE t2.a > 4;

How would you express a join condition with that syntax?

> This looks easy to remember, since it compares to:
>
>  CREATE INDEX idx_name ON t2(a,b) WITH (options) WHERE t2.a > 4;
>
> Or am I'm missing something?

Sadly yes, you are, and it's not the first time.

I seem to remember mentioning this to you already in [1].

Please, can you read over [2], it mentions exactly what you're
proposing and why it's not any good.

[1] 
https://www.postgresql.org/message-id/cakjs1f9hmet+7adiceau8heompob5pkkcvyzliezje3dvut...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAEZATCUtGR+U5+QTwjHhe9rLG2nguEysHQ5NaqcK=vbj78v...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why is get_cheapest_parallel_safe_total_inner() in pathkeys.c?

2017-04-21 Thread David Rowley
This probably ended up here because there's a bunch of other functions
named get_cheapest* in that file, but all of those relate to getting a
path for specific PathKeys. get_cheapest_parallel_safe_total_inner()
does not do that.

Maybe allpaths.c is a better home for it?

It seems to have been added in [1]

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a71f10189dc10a2fe422158a2c9409e0f77c6b9e

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fixup some misusage of appendStringInfo and friends

2017-04-18 Thread David Rowley
The attached cleans up a few small misusages of appendStringInfo and
related functions.

Some similar work was done in,

f92d6a540ac443f85f0929b284edff67da14687a
d02f16470f117db3038dbfd87662d5f0eb5a2a9b
cacbdd78106526d7c4f11f90b538f96ba8696fb0

so the majority of these should all be new misuseages.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


appendStringInfo_fixes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should pg_current_wal_location() become pg_current_wal_lsn()

2017-04-18 Thread David Rowley
On 19 April 2017 at 15:31, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> OK, so I've read over this thread again and I think it's time to
>> summarise the votes:
>> ...
>> In favour of "location" -> "lsn": Stephen, David Steel,
>> In favour of "lsn" -> "location": Peter, Tom, Kyotaro
>
> FWIW, I was not voting in favor of "location"; I was just saying that
> I wanted consistency.  If we're voting which way to move, please count
> me as a vote for "lsn".

Updated votes:

In favour of "location" -> "lsn": Tom, Stephen, David Steel
In favour of "lsn" -> "location": Peter, Kyotaro

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should pg_current_wal_location() become pg_current_wal_lsn()

2017-04-18 Thread David Rowley
On 19 April 2017 at 03:33, David Steele <da...@pgmasters.net> wrote:
> +1, and I'm in favor of using "lsn" wherever applicable.  It seems to me
> that if a user calls a function (or queries a table) that returns an lsn
> then they should be aware of what an lsn is.

OK, so I've read over this thread again and I think it's time to
summarise the votes:

It seem that;

Robert mentions he'd be inclined not to tinker with this, but mentions
nothing of inconsistency.
Bruce mentions he'd like consistency but does not mention which he'd prefer.
Stephen wants consistency and votes to change "location" to "lsn" in
regards to a pg_lsn.
Peter would rather see use of "location", mentions about changing the
new in v10 stuff, but not about the existing 9.6 usages of lsn in
column names
Tom would not object to use of "location" over "lsn".
Kyotaro would rather see the use of "location" for all apart from the
pg_lsn operator functions. Unsure how we handle pg_wal_location_diff()
David Steel would rather see this changed to use "lsn" instead of location.


So in summary, nobody apart from Robert appeared to have any concerns
over breaking backward compatibility.

In favour of "location" -> "lsn": Stephen, David Steel,
In favour of "lsn" -> "location": Peter, Tom, Kyotaro

I left Bruce out since he just voted for consistency.

So "lsn" -> "location" seems to be winning

Is that enough to proceed?

Anyone else?

The patch to do this should likely also include a regression test to
ensure nothing new creeps in which breaks the new standard.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extended stats not friendly towards ANALYZE with subset of columns

2017-04-17 Thread David Rowley
On 18 April 2017 at 05:12, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:
> Pushed, after tweaking so that a WARNING is still emitted, because it's
> likely to be useful in pointing out a procedural mistake while extended
> stats are being tested.

Thanks for pushing.

Seems you maintained most of my original patch and added to it a bit,
but credits don't seem to reflect that. It's not the end of the world
but just wanted to note that.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing extended stats on foreign and partitioned tables

2017-04-17 Thread David Rowley
On 18 April 2017 at 09:01, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote:
> David Rowley wrote:
>> While reviewing extended stats I noticed that it was possible to
>> create extended stats on many object types, including sequences. I
>> mentioned that this should be disallowed. Statistics were then changed
>> to be only allowed on plain tables and materialized views.
>>
>> This should be relaxed again to allow anything ANALYZE is allowed on.
>>
>> The attached does this.
>
> The error message was inconsistent in the case of indexes, because of
> using heap_open instead of relation_open.  That seemed gratuitous, so I
> flipped it, added test for the whole thing and pushed.

Thanks for changing that and pushing this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign Join pushdowns not working properly for outer joins

2017-04-12 Thread David Rowley
On 13 April 2017 at 11:22, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> Is this patch considered ready for review as a backpatch candidate?

Yes, however, the v5 patch is based on master. The v4 patch should
apply to 9.6. Diffing the two patches I see another tiny change to a
comment, of which I think needs re-worded anyway.

+ * This function should usually set FDW options in fpinfo after the join is
+ * deemed safe to push down to save some CPU cycles. But We need server
+ * specific options like extensions to decide push-down safety. For
+ * checking extension shippability, we need foreign server as well.
+ */

This might be better written as:

Ordinarily, we might be tempted into delaying the merging of the FDW
options until we've deemed the foreign join to be ok. However, we must
do this before performing this test so that we know which quals can be
evaluated on the foreign server. This may depend on the
shippable_extensions.

Apart from that, it all looks fine to me.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_statistic_ext.staenabled might not be the best column name

2017-04-12 Thread David Rowley
I'd been thinking that staenabled is not the most suitable column name
for storing the types of statistics that are defined for the extended
statistics.  For me, this indicates that something can be disabled,
but there's no syntax for that, and even if there was, if we were to
enable/disable the kinds, we'd likely want to keep tabs on which kinds
were originally defined, otherwise it's more of an ADD and DROP than
an ENABLE/DISABLE.

"stakind" seems like a better name. I'd have personally gone with
"statype" but pg_statistic already thinks stakind is better.

A patch which changes this is attached

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


ext_stats_rename_staenabled.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign Join pushdowns not working properly for outer joins

2017-04-12 Thread David Rowley
On 12 April 2017 at 21:45, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> On Wed, Apr 12, 2017 at 12:18 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> On 10 March 2017 at 17:33, Ashutosh Bapat
>> <ashutosh.ba...@enterprisedb.com> wrote:
>>> Yes and I also forgot to update the function prologue to refer to the
>>> fpinfo_o/i instead of inner and outer relations. Attached patch
>>> corrects it.
>>
>> Hi Ashutosh,
>>
>> This seems to have conflicted with 28b04787. Do you want to rebase, or 
>> should I?
>
> Here you go.

Looks like the wrong patch.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Foreign Join pushdowns not working properly for outer joins

2017-04-12 Thread David Rowley
On 10 March 2017 at 17:33, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> Yes and I also forgot to update the function prologue to refer to the
> fpinfo_o/i instead of inner and outer relations. Attached patch
> corrects it.

Hi Ashutosh,

This seems to have conflicted with 28b04787. Do you want to rebase, or should I?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_stats_ext view does not seem all that useful

2017-04-10 Thread David Rowley
During my review and time spent working on the functional dependencies
part of extended statistics I wondered what was the use for the
pg_stats_ext view. I was unsure why the length of the serialised
dependencies was useful.

Perhaps we could improve the view, but I'm not all that sure what
value it would add. Maybe we need to discuss this, but in the
meantime, I've attached a patch which just removes the view completely


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


drop_pg_stats_ext_view.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Allowing extended stats on foreign and partitioned tables

2017-04-10 Thread David Rowley
While reviewing extended stats I noticed that it was possible to
create extended stats on many object types, including sequences. I
mentioned that this should be disallowed. Statistics were then changed
to be only allowed on plain tables and materialized views.

This should be relaxed again to allow anything ANALYZE is allowed on.

The attached does this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


ext_stats_on_analyzable_objects.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Should pg_current_wal_location() become pg_current_wal_lsn()

2017-04-10 Thread David Rowley
... and of course the other functions matching *wal*location*

My thoughts here are that we're already breaking backward
compatibility of these functions for PG10, so thought we might want to
use this as an opportunity to fix the naming a bit more.

I feel that the "location" word not the best choice.  We also have a
function called pg_tablespace_location() to give us the path that a
tablespace is stored in, so I could understand anyone who's confused
about what pg_current_wal_location() might do if they're looking at
the function name only, and not the docs.

For me, "lsn" suits these function names much better, so I'd like to
see what other's think.

It would be sad to miss this opportunity without at least discussing this first.

The functions in question are:

postgres=# \dfS *wal*location*
   List of functions
   Schema   |  Name  | Result data type |
Argument data types |  Type
++--+-+
 pg_catalog | pg_current_wal_flush_location  | pg_lsn   |
   | normal
 pg_catalog | pg_current_wal_insert_location | pg_lsn   |
   | normal
 pg_catalog | pg_current_wal_location| pg_lsn   |
   | normal
 pg_catalog | pg_last_wal_receive_location   | pg_lsn   |
   | normal
 pg_catalog | pg_last_wal_replay_location| pg_lsn   |
   | normal
 pg_catalog | pg_wal_location_diff   | numeric  |
pg_lsn, pg_lsn  | normal
(6 rows)

Opinions?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Compiler warning in costsize.c

2017-04-10 Thread David Rowley
On 8 April 2017 at 04:42, Tom Lane <t...@sss.pgh.pa.us> wrote:
> I'd be happier with something along the line of
>
> RangeTblEntry *rte;
> ListCell   *lc;
>
> /* Should only be applied to base relations that are subqueries */
> Assert(rel->relid > 0);
> rte = planner_rt_fetch(rel->relid, root);
> #ifdef USE_ASSERT_CHECKING
> Assert(rte->rtekind == RTE_SUBQUERY);
> #else
> (void) rte;  /* silence unreferenced-variable complaints */
> #endif

the (void) rte; would not be required to silence MSVC here. Of course,
PG_USED_FOR_ASSERTS_ONLY would be required to stop some smarter
compiler from complaining.

> assuming that that actually does silence the warning on MSVC.
>
> BTW, is it really true that only these two places produce such warnings
> on MSVC?  I see about three dozen uses of PG_USED_FOR_ASSERTS_ONLY in our
> tree, and I'd have expected all of those places to be causing warnings on
> a compiler that doesn't have a way to understand that annotation.

Seems that MSVC is happy once the variable is assigned, and does not
bother checking if the variable is used after being assigned, whereas,
some other compilers might see the variable as uselessly assigned.

At the moment there are no other warnings from MSVC since all the
other places the variable gets assigned a value in some code path.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-09 Thread David Rowley
On 8 April 2017 at 09:33, Claudio Freire <klaussfre...@gmail.com> wrote:
> Otherwise, the patch LGTM, but I'd like to solve the quadratic
> behavior too... are you going to try? Otherwise I could take a stab at
> it myself. It doesn't seem very difficult.

I have some ideas in my head in a fairly generic way of solving this
which I'd like to take care of in PG11.

Many thanks for your reviews and suggestions on this patch, it's much
appreciated.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-09 Thread David Rowley
On 8 April 2017 at 14:23, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
> [ unique_joins_2017-04-07b.patch ]
>
> It turned out that this patch wasn't as close to committable as I'd
> thought, but after a full day of whacking at it, I got to a place
> where I thought it was OK.  So, pushed.

Many thanks for taking the time to do this, and committing too!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Making clausesel.c Smarter

2017-04-06 Thread David Rowley
On 4 April 2017 at 13:35, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Apr 3, 2017 at 9:19 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> If you ask me, I'd just leave:
>
> + if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound ==
> DEFAULT_INEQ_SEL)
> + {
> + /* No point in checking null selectivity, DEFAULT_INEQ_SEL
> implies we have no stats */
> + s2 = DEFAULT_RANGE_INEQ_SEL;
> + }
> + else
> + {
> ...
> +   s2 = rqlist->hibound + rqlist->lobound - 1.0;
> +   nullsel = nulltestsel(root, IS_NULL, rqlist->var, varRelid);
> +   notnullsel = 1.0 - nullsel;
> +
> +   /* Adjust for double-exclusion of NULLs */
> +   s2 += nullsel;
> +   if (s2 <= 0.0) {
> +  if (s2 <= (-1.0e-4 * notnullsel - 1.0e-6))
> +  {
> +   /* Most likely contradictory clauses found */
> +   s2 = (1.0e-10 < notnullsel) ? 1.0e-10 : notnullsel;
> +  }
> +  else
> +  {
> +   /* Could be a rounding error */
> +   s2 = DEFAULT_RANGE_INEQ_SEL * notnullsel;
> +  }
> +   }
> + }
>
> Where (-1.0e-4 * notnullsel - 1.0e-6) is just a very rough (and overly
> cautious) estimation of the amount of rounding error that could be
> there with 32-bit floats.
>
> The above assumes a non-DEFAULT_INEQ_SEL value in lobound/hibound is
> an estimation based on stats, maybe approximate, but one that makes
> sense (ie: preserves the monotonicity of the CPF), and as such
> negative values are either sign of a contradiction or rounding error.
> The previous code left the possibility of negatives coming out of
> default selectivities creeping up on non-DEFAULT_INEQ_SEL estimates,
> but that doesn't seem possible coming out of scalarineqsel.

I notice this does change the estimates for contradictory clauses such as:

create table a (value int);
insert into a select x/100 from generate_Series(1,1) x;
analyze a;
explain analyze select * from a where value between 101 and -1;

We now get:

 QUERY PLAN
-
 Seq Scan on a  (cost=0.00..195.00 rows=1 width=4) (actual
time=1.904..1.904 rows=0 loops=1)
   Filter: ((value >= 101) AND (value <= '-1'::integer))
   Rows Removed by Filter: 1
 Planning time: 0.671 ms
 Execution time: 1.950 ms
(5 rows)

where before we'd get:

  QUERY PLAN
--
 Seq Scan on a  (cost=0.00..195.00 rows=50 width=4) (actual
time=0.903..0.903 rows=0 loops=1)
   Filter: ((value >= 101) AND (value <= '-1'::integer))
   Rows Removed by Filter: 1
 Planning time: 0.162 ms
 Execution time: 0.925 ms
(5 rows)

Which comes from the 1 * 0.005 selectivity estimate from tuples *
DEFAULT_RANGE_INEQ_SEL

I've attached a patch to this effect.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


smarter_clausesel_2017-04-07a.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Time to change pg_regress diffs to unified by default?

2017-04-06 Thread David Rowley
On 7 April 2017 at 10:31, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> I personally, and I know of a bunch of other regular contributors, find
> context diffs very hard to read.  Besides general dislike, for things
> like regression test output context diffs are just not well suited.
> E.g. in
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=prairiedog=2017-04-06%2021%3A10%3A56=check
> the salient point (ERROR:  50 is outside the valid range for parameter 
> "effective_io_concurrency" (0 .. 0))
> is 130 lines into the diff, whereas it's right at the start in a unified diff.
> Issues with one error that causes a lot of followup output changes are
> really common in our regression suite.
>
> I personally have PG_REGRESS_DIFF_OPTS set to -dU10, but that doesn't
> help much analyzing buildfarm output.
>
> Therefore I propose changing the defaults in pg_regress.c.

You mean people actually use those diffs? I've never done anything
apart from using . That way I can
reject and accept the changes as I wish, just by kicking the results
over to the expected results, or not, if there's a genuine mistake.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread David Rowley
On 7 April 2017 at 13:41, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> On 7 April 2017 at 11:47, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> What I'm on about is that you can't do the early advance to the
>>> next outer tuple unless you're sure that all the quals that were
>>> relevant to the uniqueness proof have been checked for the current
>>> inner tuple.  That affects all three join types not only merge.
>
>> Well, look at the join code and you'll see this only happens after the
>> joinqual is evaulated. I didn't make a special effort here. I just
>> borrowed the location that JOIN_SEMI was already using.
>
> Right, and that's exactly the point: some of the conditions you're
> depending on might have ended up in the otherqual not the joinqual.
>
> We'd discussed rearranging the executor logic enough to deal with
> such situations and agreed that it seemed too messy; but that means
> that the optimization needs to take care not to use otherqual
> (ie pushed-down) conditions in the uniqueness proofs.
>
>> Oh yeah. I get it, but that's why we ignore !can_join clauses
>
> can_join seems to me to be not particularly relevant ... there's
> nothing that prevents that from getting set for pushed-down clauses.
>
> It's possible that the case I'm worried about is unreachable in
> practice because all the conditions that could be of interest would?
> be strict and therefore would have forced join strength reduction.
> But I'm not comfortable with assuming that.

Okay, well how about we protect against that by not using such quals
as unique proofs? We'd just need to ignore anything that's
outerjoin_delayed?

If we're struggling to think of a case that this will affect, then we
shouldn't be too worried about any missed optimisations.

A patch which does this is attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2017-04-07b.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance improvement for joins where outer side is unique

2017-04-06 Thread David Rowley
On 7 April 2017 at 11:47, Tom Lane <t...@sss.pgh.pa.us> wrote:
> David Rowley <david.row...@2ndquadrant.com> writes:
>> On 7 April 2017 at 07:26, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> I'm looking through this, and I'm failing to see where it deals with
>>> the problem we discussed last time, namely that you can't apply the
>>> optimization unless all clauses that were used in the uniqueness
>>> proof are included in the join's merge/hash conditions + joinquals.
>
>> The code in question is:
>> mergestate->mj_SkipMarkRestore = !mergestate->js.joinqual &&
>> mergestate->js.first_inner_tuple_only;
>
> Uh, AFAICS that only protects the skip-mark-and-restore logic.
> What I'm on about is that you can't do the early advance to the
> next outer tuple unless you're sure that all the quals that were
> relevant to the uniqueness proof have been checked for the current
> inner tuple.  That affects all three join types not only merge.

Well, look at the join code and you'll see this only happens after the
joinqual is evaulated. I didn't make a special effort here. I just
borrowed the location that JOIN_SEMI was already using.

For example, from hash join:

if (joinqual == NULL || ExecQual(joinqual, econtext))
{
node->hj_MatchedOuter = true;
HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));

/* In an antijoin, we never return a matched tuple */
if (node->js.jointype == JOIN_ANTI)
{
node->hj_JoinState = HJ_NEED_NEW_OUTER;
continue;
}

/*
* Skip to the next outer tuple if we only need to join to
* the first inner matching tuple.
*/
if (node->js.first_inner_tuple_only)
node->hj_JoinState = HJ_NEED_NEW_OUTER;

Note the first line and the final two lines.

Here's the one from nested loop:

if (ExecQual(joinqual, econtext))
{
node->nl_MatchedOuter = true;

/* In an antijoin, we never return a matched tuple */
if (node->js.jointype == JOIN_ANTI)
{
node->nl_NeedNewOuter = true;
continue; /* return to top of loop */
}

/*
* Skip to the next outer tuple if we only need to join to the
* first inner matching tuple.
*/
if (node->js.first_inner_tuple_only)
node->nl_NeedNewOuter = true;

Again, note the first line and final 2 lines.

> The case that would be relevant to this is, eg,
>
> create table t1 (f1 int, f2 int, primary key(f1,f2));
>
> select * from t_outer left join t1 on (t_outer.f1 = t1.f1)
> where t_outer.f2 = t2.f2;

hmm, that query is not valid, unless you have created some table named
t_outer. I don't know how you've defined that. So I guess you must
have meant:

postgres=# explain verbose select * from t1 t_outer left join t1 on
(t_outer.f1 = t1.f1) where t_outer.f2 = t1.f2;
QUERY PLAN
---
 Hash Join  (cost=66.50..133.57 rows=128 width=16)
   Output: t_outer.f1, t_outer.f2, t1.f1, t1.f2
   Inner Unique: Yes
   Hash Cond: ((t_outer.f1 = t1.f1) AND (t_outer.f2 = t1.f2))
   ->  Seq Scan on public.t1 t_outer  (cost=0.00..32.60 rows=2260 width=8)
 Output: t_outer.f1, t_outer.f2
   ->  Hash  (cost=32.60..32.60 rows=2260 width=8)
 Output: t1.f1, t1.f2
 ->  Seq Scan on public.t1  (cost=0.00..32.60 rows=2260 width=8)
   Output: t1.f1, t1.f2
(10 rows)

Which did become an INNER JOIN due to the strict W

If you'd had done:

postgres=# explain verbose select * from t1 t_outer left join t1 on
(t_outer.f1 = t1.f1) where t_outer.f2 = t1.f2 or t1.f1 is null;
   QUERY PLAN

 Merge Left Join  (cost=0.31..608.67 rows=255 width=16)
   Output: t_outer.f1, t_outer.f2, t1.f1, t1.f2
   Inner Unique: No
   Merge Cond: (t_outer.f1 = t1.f1)
   Filter: ((t_outer.f2 = t1.f2) OR (t1.f1 IS NULL))
   ->  Index Only Scan using t1_pkey on public.t1 t_outer
(cost=0.16..78.06 rows=2260 width=8)
 Output: t_outer.f1, t_outer.f2
   ->  Materialize  (cost=0.16..83.71 rows=2260 width=8)
 Output: t1.f1, t1.f2
 ->  Index Only Scan using t1_pkey on public.t1
(cost=0.16..78.06 rows=2260 width=8)
   Output: t1.f1, t1.f2
(11 rows)

You'll notice that "Inner Unique: No"

> Your existing patch would think t1 is unique-inner, but the qual pushed
> down from WHERE would not be a joinqual so the wrong thing would happen
> at runtime.
>
> (Hm ... actually, this example wouldn't fail as written because
> the WHERE qual is probably strict, so the left join would get
> reduced to an inner join and then pushed-down-ness no longer
> matters.  But hopefully you get my drift.)

Oh yeah. I get it, but that's why we ignore !can_join clauses

/* Ignore if it's not a mergejoinable clause */
if (!restrictinfo->can_join

  1   2   3   4   5   6   7   8   9   10   >