subject:"\[HACKERS\] Improve OR conditions on joined columns."

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-04-03 Thread Tom Lane

Andres Freund  writes:
> On 2017-02-14 14:18:54 -0500, Tom Lane wrote:
>> One point that could use further review is whether the de-duplication
>> algorithm is actually correct.  I'm only about 95% convinced by the
>> argument I wrote in planunionor.c's header comment.

> Are you planning to push this into v10? Given the looming freeze I'm
> inclined to bump this to the next CF.

I'm still not fully convinced the patch is correct, so I have no problem
with bumping it out.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-04-03 Thread Andres Freund

Hi,

On 2017-02-14 14:18:54 -0500, Tom Lane wrote:
> I think this might be code-complete, modulo the question of whether we
> want an enabling GUC for it.  I'm still concerned about the question
> of whether it adds more planning time than it's worth for most users.
> (Obviously it needs some regression test cases too, and a lot more
> real-world testing than I've given it.)

> One point that could use further review is whether the de-duplication
> algorithm is actually correct.  I'm only about 95% convinced by the
> argument I wrote in planunionor.c's header comment.
> 
> Potential future work includes finding join ORs in upper-level INNER JOIN
> ON clauses, and improving matters so we can do the unique-ification with
> hashing as well as sorting.  I don't think either of those things has to
> be in the first commit, though.

Are you planning to push this into v10? Given the looming freeze I'm
inclined to bump this to the next CF.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-14 Thread Tom Lane

Jim Nasby  writes:
> On 2/14/17 1:18 PM, Tom Lane wrote:
>> One point that could use further review is whether the de-duplication
>> algorithm is actually correct.  I'm only about 95% convinced by the
>> argument I wrote in planunionor.c's header comment.

> I'll put some thought into it and see if I can find any holes. Are you 
> only worried about the removal of "useless" rels or is there more?

Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm.  I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.
Now I'm wondering if there are other holes; or maybe I'm wrong about
that one and it's not necessary to be afraid of full joins.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-14 Thread Jim Nasby


On 2/14/17 1:18 PM, Tom Lane wrote:

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it.  I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)


Don't we normally provide a GUC for this level of query manipulation? We 
can always remove it later if evidence shows there's not sufficient 
downside (perhaps after gaining the ability for the planner to do extra 
work on queries that appear to be large enough to warrant it).



One point that could use further review is whether the de-duplication
algorithm is actually correct.  I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.


Another reason for a GUC...

I'll put some thought into it and see if I can find any holes. Are you 
only worried about the removal of "useless" rels or is there more?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-14 Thread Tom Lane

I wrote:
> The main remaining piece of work here is that, as you can see from the
> above, it fails to eliminate joins to tables that we don't actually need
> in a particular UNION arm.  This is because the references to those
> tables' ctid columns prevent analyzejoins.c from removing the joins.
> I've thought about ways to deal with that but haven't come up with
> anything that wasn't pretty ugly and/or invasive.

I thought of a way that wasn't too awful, which was to inject the requests
for CTID columns only after we've done join removal.  The attached v2
patch produces this for your test case:

  QUERY 
PLAN   
---
 Aggregate  (cost=36.84..36.85 rows=1 width=8) (actual time=0.075..0.075 rows=1 
loops=1)
   ->  Result  (cost=36.77..36.83 rows=4 width=0) (actual time=0.069..0.073 
rows=3 loops=1)
 ->  Unique  (cost=36.77..36.79 rows=4 width=6) (actual 
time=0.069..0.072 rows=3 loops=1)
   ->  Sort  (cost=36.77..36.78 rows=4 width=6) (actual 
time=0.068..0.069 rows=4 loops=1)
 Sort Key: f.ctid
 Sort Method: quicksort  Memory: 25kB
 ->  Append  (cost=4.57..36.73 rows=4 width=6) (actual 
time=0.018..0.033 rows=4 loops=1)
   ->  Nested Loop  (cost=4.57..18.37 rows=2 width=6) 
(actual time=0.018..0.020 rows=2 loops=1)
 ->  Index Scan using dim_t_key on dim d1  
(cost=0.28..8.29 rows=1 width=10) (actual time=0.005..0.005 rows=1 loops=1)
   Index Cond: ('1'::text = t)
 ->  Bitmap Heap Scan on fact f  
(cost=4.30..10.05 rows=2 width=14) (actual time=0.010..0.012 rows=2 loops=1)
   Recheck Cond: (f1 = d1.s)
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on f_f1  
(cost=0.00..4.29 rows=2 width=0) (actual time=0.006..0.006 rows=2 loops=1)
 Index Cond: (f1 = d1.s)
   ->  Nested Loop  (cost=4.57..18.37 rows=2 width=6) 
(actual time=0.009..0.011 rows=2 loops=1)
 ->  Index Scan using dim_t_key on dim d2  
(cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
   Index Cond: ('1'::text = t)
 ->  Bitmap Heap Scan on fact f  
(cost=4.30..10.05 rows=2 width=14) (actual time=0.005..0.006 rows=2 loops=1)
   Recheck Cond: (f2 = d2.s)
   Heap Blocks: exact=2
   ->  Bitmap Index Scan on f_f2  
(cost=0.00..4.29 rows=2 width=0) (actual time=0.003..0.003 rows=2 loops=1)
 Index Cond: (f2 = d2.s)
 Planning time: 0.732 ms
 Execution time: 0.156 ms
(25 rows)

I think this might be code-complete, modulo the question of whether we
want an enabling GUC for it.  I'm still concerned about the question
of whether it adds more planning time than it's worth for most users.
(Obviously it needs some regression test cases too, and a lot more
real-world testing than I've given it.)

One point that could use further review is whether the de-duplication
algorithm is actually correct.  I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.

Potential future work includes finding join ORs in upper-level INNER JOIN
ON clauses, and improving matters so we can do the unique-ification with
hashing as well as sorting.  I don't think either of those things has to
be in the first commit, though.

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1560ac3..fa34efb 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outPlannerInfo(StringInfo str, const Pl
*** 2078,2083 
--- 2078,2084 
  	WRITE_FLOAT_FIELD(tuple_fraction, "%.4f");
  	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
  	WRITE_UINT_FIELD(qual_security_level);
+ 	WRITE_BOOL_FIELD(needBaseTids);
  	WRITE_BOOL_FIELD(hasInheritedTarget);
  	WRITE_BOOL_FIELD(hasJoinRTEs);
  	WRITE_BOOL_FIELD(hasLateralRTEs);
diff --git a/src/backend/optimizer/plan/Makefile b/src/backend/optimizer/plan/Makefile
index 88a9f7f..1db6bd5 100644
*** a/src/backend/optimizer/plan/Makefile
--- b/src/backend/optimizer/plan/Makefile
*** top_builddir = ../../../..
*** 13,18 
  include $(top_builddir)/src/Makefile.global
  
  OBJS = analyzejoins.o createplan.o initsplan.o planagg.o planmain.o planner.o \
! 	setrefs.o subselect.o
  
  include

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-12 Thread Jim Nasby


On 2/12/17 5:06 PM, David Rowley wrote:

Yet I've worked with OLTP applications
since 2005, and I struggle to recall any many:many joins at all.


Interesting... I've run across it numerous times. In any case, for OLTP 
there's other things you can do fairly easily.



Perhaps this optimisation is a candidate for only being applied when
some sort of planner_strength GUC (as mentioned in FOSDEM developer
meeting in 2016) reaches some threshold. There's certainly already
some planner smarts that can be skipped when such a GUC is set to a
lower level (e.g join removal). We could likely save many cycles if we
had the ability to re-plan queries where total_cost > X with more
smarts enabled.


Yeah, I strongly suspect some kind of "multi-stage" planner would be a 
big win.


As for the POC, that's the same kind of plan I'm seeing IRL: a nested 
loop gets used essentially to do the lookup of dimension text to 
dimension ID.


I'm wondering if there's any tricks that could be applied on the sort 
since it's dealing with CTIDs.


I do think it'd be even better if we had the ability to do that lookup 
as part of planning, so you could discover the exact stats for the 
relevant ID values, but that's even more involved.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-12 Thread David Rowley

On 13 February 2017 at 06:32, Tom Lane  wrote:
> It's not so much poor choices as the cost of the optimization attempt ---
> if there's a K-relation OR clause, this will increase the cost of planning
> by a factor approaching K+1, whether or not you get a better plan out of
> it.  I ran the regression tests with some instrumentation and determined
> that this logic fires a dozen or two times, and fails to produce a plan
> that looks cheaper than the standard plan in any of those cases.  So if we
> go down this road, not only do we need a GUC but I suspect it had better
> default to off; only people using star schemas are really likely to get a
> win out of it.

I always try to shy away from assuming that the regression test suite
is a good reflection of a real world set of queries. It's full of
tests for edge cases that are rarely seen in reality. FWIW I did a
similar experiment with unique joins and was disappointed to see that
it didn't apply in more cases. Yet I've worked with OLTP applications
since 2005, and I struggle to recall any many:many joins at all.

Perhaps this optimisation is a candidate for only being applied when
some sort of planner_strength GUC (as mentioned in FOSDEM developer
meeting in 2016) reaches some threshold. There's certainly already
some planner smarts that can be skipped when such a GUC is set to a
lower level (e.g join removal). We could likely save many cycles if we
had the ability to re-plan queries where total_cost > X with more
smarts enabled.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-12 Thread Tom Lane

David Rowley  writes:
> This is very interesting. Couldn't this be even more generic and
> instead of looking at just the jointree quals, also look at the join
> conditions too, as I think you can use this to also transforms queries
> such as:

The hard part of that is to remember exactly where you need to do surgery
on the querytree to replace the OR clause, keeping in mind that a tree
copy step is going to happen in between first searching for the OR and
doing the replacement.  I'm sure it's soluble but it would've involved
more complexity than I felt justified for a POC.

> I imagine you'd also want an enable_unionor GUC which can be used to
> disable this for when the planner makes a poor choice.

It's not so much poor choices as the cost of the optimization attempt ---
if there's a K-relation OR clause, this will increase the cost of planning
by a factor approaching K+1, whether or not you get a better plan out of
it.  I ran the regression tests with some instrumentation and determined
that this logic fires a dozen or two times, and fails to produce a plan
that looks cheaper than the standard plan in any of those cases.  So if we
go down this road, not only do we need a GUC but I suspect it had better
default to off; only people using star schemas are really likely to get a
win out of it.

Maybe we could improve that situation by applying some heuristics that
prevent the optimization from being attempted unless it's more likely to
produce a win.  But I'm not sure what that would look like.  We have to
make the decision pretty early and without a lot of information.  As an
example, Jim's case fails completely if we don't do the transformation
before reduce_outer_joins, because one of the things that *has* to happen
to get a desirable plan is to recognize that the extracted restriction
clause allows the left join to the dimension table to be simplified to an
inner join.

Having written the POC, I find myself not terribly satisfied with it.
It seems brute-force and not at all integrated with the rest of the
planner.  Still, asking for integration with the existing UNION planning
is probably silly, since that really needs to be thrown away and rewritten.
Maybe this is a reasonable way forward until we get a clearer view of
what that area needs to look like.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-12 Thread David Rowley

On 12 February 2017 at 13:30, Tom Lane  wrote:
> I wrote a POC patch for this on a long airplane ride.  It's not complete,
> and I'm sure there are bugs as well, but it makes your example case
> better.  What I did about the de-duplication issue is to de-dup using
> the CTIDs of all baserels in the query.  This means the optimization
> is only applicable to cases where all the rels have CTIDs ... but other
> methods such as inspecting unique keys ain't gonna work for non-table
> rels either, so I think this is about the best we can hope for.
> However, I did not understand your point about:

This is very interesting. Couldn't this be even more generic and
instead of looking at just the jointree quals, also look at the join
conditions too, as I think you can use this to also transforms queries
such as:

select * from t1 inner join t2 on t1.a = t2.a or t1.b = t2.b;

I imagine you'd also want an enable_unionor GUC which can be used to
disable this for when the planner makes a poor choice.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-11 Thread Tom Lane

Jim Nasby  writes:
> On 2/8/17 5:54 PM, Tom Lane wrote:
>> Maybe it'd be better to imagine this as something closer to planagg.c,
>> that is it knows how to apply a specific high-level optimization that
>> might or might not be a win, so it builds a path describing that and sees
>> if it looks cheaper than a path done the normal way.  The fact that
>> we could even build a path describing a union is something that wasn't
>> there before 9.6, but maybe there's enough infrastructure for that now.

> It's encouraging that there's at least a template to follow there... If 
> there is missing infrastructure, is it likely to be major? My uninformed 
> guess would be that the pathification was the major hurdle.

I wrote a POC patch for this on a long airplane ride.  It's not complete,
and I'm sure there are bugs as well, but it makes your example case
better.  What I did about the de-duplication issue is to de-dup using
the CTIDs of all baserels in the query.  This means the optimization
is only applicable to cases where all the rels have CTIDs ... but other
methods such as inspecting unique keys ain't gonna work for non-table
rels either, so I think this is about the best we can hope for.
However, I did not understand your point about:

> BTW, there's an important caveat here: users generally do NOT want 
> duplicate rows from the fact table if the dimension table results aren't 
> unique.

so maybe we should be thinking about some other way entirely?

Anyway, what's attached below produces this on your example query:

 Aggregate  (cost=38.12..38.13 rows=1 width=8) (actual time=0.158..0.158 rows=1 
loops=1)
   ->  Result  (cost=38.03..38.11 rows=4 width=0) (actual time=0.151..0.155 
rows=3 loops=1)
 ->  Unique  (cost=38.03..38.07 rows=4 width=18) (actual 
time=0.150..0.154 rows=3 loops=1)
   ->  Sort  (cost=38.03..38.04 rows=4 width=18) (actual 
time=0.150..0.151 rows=4 loops=1)
 Sort Key: f.ctid, d1.ctid, d2.ctid
 Sort Method: quicksort  Memory: 25kB
 ->  Append  (cost=4.85..37.99 rows=4 width=18) (actual 
time=0.070..0.106 rows=4 loops=1)
   ->  Nested Loop Left Join  (cost=4.85..19.00 rows=2 
width=18) (actual time=0.069..0.075 rows=2 loops=1)
 ->  Nested Loop  (cost=4.57..18.37 rows=2 
width=16) (actual time=0.035..0.038 rows=2 loops=1)
   ->  Index Scan using dim_t_key on dim d1 
 (cost=0.28..8.29 rows=1 width=10) (actual time=0.009..0.009 rows=1 loops=1)
 Index Cond: ('1'::text = t)
   ->  Bitmap Heap Scan on fact f  
(cost=4.30..10.05 rows=2 width=14) (actual time=0.023..0.025 rows=2 loops=1)
 Recheck Cond: (f1 = d1.s)
 Heap Blocks: exact=2
 ->  Bitmap Index Scan on f_f1  
(cost=0.00..4.29 rows=2 width=0) (actual time=0.016..0.016 rows=2 loops=1)
   Index Cond: (f1 = d1.s)
 ->  Index Scan using dim_pkey on dim d2  
(cost=0.28..0.31 rows=1 width=10) (actual time=0.016..0.017 rows=0 loops=2)
   Index Cond: (f.f2 = s)
   ->  Nested Loop Left Join  (cost=4.85..19.00 rows=2 
width=18) (actual time=0.025..0.029 rows=2 loops=1)
 ->  Nested Loop  (cost=4.57..18.37 rows=2 
width=16) (actual time=0.022..0.025 rows=2 loops=1)
   ->  Index Scan using dim_t_key on dim d2 
 (cost=0.28..8.29 rows=1 width=10) (actual time=0.003..0.003 rows=1 loops=1)
 Index Cond: ('1'::text = t)
   ->  Bitmap Heap Scan on fact f  
(cost=4.30..10.05 rows=2 width=14) (actual time=0.017..0.019 rows=2 loops=1)
 Recheck Cond: (f2 = d2.s)
 Heap Blocks: exact=2
 ->  Bitmap Index Scan on f_f2  
(cost=0.00..4.29 rows=2 width=0) (actual time=0.014..0.014 rows=2 loops=1)
   Index Cond: (f2 = d2.s)
 ->  Index Scan using dim_pkey on dim d1  
(cost=0.28..0.31 rows=1 width=10) (actual time=0.001..0.001 rows=0 loops=2)
   Index Cond: (f.f1 = s)

The main remaining piece of work here is that, as you can see from the
above, it fails to eliminate joins to tables that we don't actually need
in a particular UNION arm.  This is because the references to those
tables' ctid columns prevent analyzejoins.c from removing the joins.
I've thought about ways to deal with that but haven't come up with
anything that wasn't pretty ugly

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-09 Thread Claudio Freire

On Thu, Feb 9, 2017 at 9:50 PM, Jim Nasby  wrote:
> WHERE t1 IN ('a','b') OR t2 IN ('c','d')
>
> into
>
> WHERE f1 IN (1,2) OR f2 IN (3,4)
>
> (assuming a,b,c,d maps to 1,2,3,4)
>
> BTW, there's an important caveat here: users generally do NOT want duplicate
> rows from the fact table if the dimension table results aren't unique. I
> thought my array solution was equivalent to what the JOINs would do in that
> case but that's actually wrong. The array solution does provide the behavior
> users generally want here though. JOIN is the easiest tool to pick up for
> this, so it's what people gravitate to, but I suspect most users would be
> happier with a construct that worked like the array trick does, but was
> easier to accomplish.
>
> I wonder if any other databases have come up with non-standard syntax to do
> this.

What I've been doing is do those transforms (tn -> fn) in application
code. While it's a chore, the improvement in plans is usually well
worth the trouble.

IF there's a FK between fact and dimension tables, you can be certain
the transform will yield equivalent results, becuase you'll be certain
the joins don't duplicate rows.

So the transform should be rather general and useful.

If you have a join of the form:

a join b on a.f1 = b.id

Where a.f1 has an FK referencing b.id, and a filter on b X of any
form, you can turn the plan into:

with b_ids as (select id from b where X)
...
a join b on a.f1 = b.id and a.f1 in (select id from b_ids)

In order to be useful, the expected row count from b_ids should be rather small.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

2017-02-09 Thread Jim Nasby


On 2/8/17 5:54 PM, Tom Lane wrote:

Although ... actually, that may not be the bottleneck for what you're
after.  The issue here is not so much discovering a clever plan for
a union as realizing that the query could be cast as a union in the
first place.


Right; their current workaround is to manually write a UNION. That's 
*significantly* faster than the OR.



Maybe it'd be better to imagine this as something closer to planagg.c,
that is it knows how to apply a specific high-level optimization that
might or might not be a win, so it builds a path describing that and sees
if it looks cheaper than a path done the normal way.  The fact that
we could even build a path describing a union is something that wasn't
there before 9.6, but maybe there's enough infrastructure for that now.


It's encouraging that there's at least a template to follow there... If 
there is missing infrastructure, is it likely to be major? My uninformed 
guess would be that the pathification was the major hurdle.



There's another transform using arrays that's possible as well (see
attached example); I believe that would work regardless of uniqueness.


That doesn't look terribly promising for automated application.
And I think it's really dependent on the exact shape of the OR
clause, which is an unpleasant limitation.  Considering the amount


Not sure what you mean by shape; do you mean whether the OR conditions 
are rels vs Consts or Vars?



of work this will take to do at all, you'd want it to be pretty
general I think.  I'm imagining something that would look for an
OR in which each clause referenced only one rel, then if it can
identify a way to re-unique-ify the result, split into a UNION
with an arm for each rel used in the OR.  The nature of the
conditions in each OR arm don't seem to matter ... though probably
you'd have to insist on there not being volatile conditions anywhere
in the query, because they'd get evaluated too many times.


In my experience, the common use case here is a large fact table that 
joins to a number of dimension tables. If you can filter the large fact 
table down to a fairly small size, those joins don't matter much. But if 
a big chunk of your filter comes from the joins, you're in trouble.


UNION isn't really a great way to solve this problem because you still 
end up doing a full join just to run the filter, but really all that you 
need are the values from the dimension table(s) that you need for the 
filter. IOW, change


WHERE t1 IN ('a','b') OR t2 IN ('c','d')

into

WHERE f1 IN (1,2) OR f2 IN (3,4)

(assuming a,b,c,d maps to 1,2,3,4)

BTW, there's an important caveat here: users generally do NOT want 
duplicate rows from the fact table if the dimension table results aren't 
unique. I thought my array solution was equivalent to what the JOINs 
would do in that case but that's actually wrong. The array solution does 
provide the behavior users generally want here though. JOIN is the 
easiest tool to pick up for this, so it's what people gravitate to, but 
I suspect most users would be happier with a construct that worked like 
the array trick does, but was easier to accomplish.


I wonder if any other databases have come up with non-standard syntax to 
do this.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns.

2017-02-08 Thread Tom Lane

Jim Nasby  writes:
> AFAICT this can be transformed into a UNION (not all) if dim.id is 
> unique. Does the upper planner pathification make this any easier? 

What I did in 9.6 is a first step.  The next step, I think, is to
replace prepunion.c with something that can consider more than one
implementation path for a union.

Although ... actually, that may not be the bottleneck for what you're
after.  The issue here is not so much discovering a clever plan for
a union as realizing that the query could be cast as a union in the
first place.

Maybe it'd be better to imagine this as something closer to planagg.c,
that is it knows how to apply a specific high-level optimization that
might or might not be a win, so it builds a path describing that and sees
if it looks cheaper than a path done the normal way.  The fact that
we could even build a path describing a union is something that wasn't
there before 9.6, but maybe there's enough infrastructure for that now.

> There's another transform using arrays that's possible as well (see 
> attached example); I believe that would work regardless of uniqueness.

That doesn't look terribly promising for automated application.
And I think it's really dependent on the exact shape of the OR
clause, which is an unpleasant limitation.  Considering the amount
of work this will take to do at all, you'd want it to be pretty
general I think.  I'm imagining something that would look for an
OR in which each clause referenced only one rel, then if it can
identify a way to re-unique-ify the result, split into a UNION
with an arm for each rel used in the OR.  The nature of the
conditions in each OR arm don't seem to matter ... though probably
you'd have to insist on there not being volatile conditions anywhere
in the query, because they'd get evaluated too many times.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Improve OR conditions on joined columns.

2017-02-08 Thread Jim Nasby

I've a client interested enough in $SUBJECT that they're willing to 
offer a bounty on it. An example of the pain is (working example attached):


create temp view denorm as
select f.*, d1.t t1, d2.t t2
from fact f
left join dim d1 on f1=d1.id
left join dim d2 on f2=d2.id
;

-- Fast
explain analyze select count(*) from denorm where 1 in (f1,f2);
explain analyze select count(*) from denorm where '1' in (t1);

-- Slow
explain analyze select count(*) from denorm where '1' in (t1,t2);

They currently work around this by doing a union:

select ... from denorm where t1 = '1'
union
select ... from denorm where t2 = '1'

... or depending on needs using IN instead of =.

AFAICT this can be transformed into a UNION (not all) if dim.id is 
unique. Does the upper planner pathification make this any easier? 
There's another transform using arrays that's possible as well (see 
attached example); I believe that would work regardless of uniqueness.


Just to be clear; the OR by itself is not a problem (as shown by the 
first fast query); it's the OR with the JOIN that's a problem.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)
create temp table fact as select generate_series(1,999) f1, 
generate_series(1,999) f2;
insert into fact select s,null from generate_series(1,999) s;
insert into fact select null,s from generate_series(1,999) s;
create index f_f1 on fact(f1);
create index f_f2 on fact(f2);
analyze fact;

create temp table dim as select s, s::text t from generate_series(1,999) s;
alter table dim add primary key(s), add unique (t);
create temp view denorm as
select f.*, d1.t t1, d2.t t2, array[f1,f2] ident_ids
from fact f
left join dim d1 on f1=d1.s
left join dim d2 on f2=d2.s
;

-- Fast
explain analyze select count(*) from denorm where 1 in (f1,f2);
explain analyze select count(*) from denorm where '1' in (t1);

-- Slow
explain analyze select count(*) from denorm where '1' in (t1,t2);

CREATE FUNCTION ident_ids(
idents text[]
) RETURNS int[] STABLE LANGUAGE sql AS $$
SELECT array( SELECT s FROM dim WHERE t = ANY(idents) )
$$;
CREATE FUNCTION ident_ids(
idents text
) RETURNS int[] STABLE LANGUAGE sql AS $$
SELECT ident_ids( ('{' || idents || '}')::text[] )
$$;

explain analyze select count(*) from denorm where ident_ids && 
ident_ids('42,84,128');

\echo ERROR OK here on version >= 10
create index fact__f_array on fact using gin ((array[f1,f2]) _int4_ops); -- pre 
10.0
\echo ERROR OK here on version < 10
create index fact__f_array on fact using gin ((array[f1,f2]) array_ops); -- 10.0

explain analyze select count(*) from denorm where ident_ids && 
ident_ids('42,84,128');

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)

Re: [HACKERS] Improve OR conditions on joined columns.

[HACKERS] Improve OR conditions on joined columns.

14 matches

Site Navigation

Mail list logo

Footer information