Re: UNION ALL

2019-08-16 Thread Mark Pasterkamp
First of all, thank you for the replies.

I am using a base installation of postgres 10.10, with no modifications to
any of the system defaults.

I am trying to speedup a join between two tables: the title table and the
cast_info table.

The title table is a table containing information about different movies.
it contains 4626969 records.
the table also has a foreign key index on the cast_info table, enabling the
planner to use a hash-join.

The cast_info table is a table containing the information of which actor
was casted in which movie and contains 62039343 records.

The database also contains a materialized view ci_t_15, defined as:
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year < 2015

I am comparing two queries, q1 and q2 respectively.
Query q1 is the original query and q2 is an attempt to reduce the cost of
execution via leveraging the materialized view ci_t_15.

Query q1 is defined as:
select * from cast_info join title on cast_info.movie_id = title.id

Query q2 is defined as
select * from cast_info join title on cast_info.movie_id = title.id
where title.production_year >= 2015
UNION ALL
select * from ci_t_15

Both queries are executed on a Dell xps laptop with an I7-8750H processor
and 16 (2*8) gb ram on an SSD running on ubuntu 18.04.2 LTS.

Running explain analyze on both queries I get the following execution plans.
q1:
"Hash Join  (cost=199773.80..2561662.10 rows=62155656 width=103) (actual
time=855.063..25786.264 rows=62039343 loops=1)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Seq Scan on cast_info  (cost=0.00..1056445.56 rows=62155656
width=42) (actual time=0.027..3837.722 rows=62039343 loops=1)"
"  ->  Hash  (cost=92232.69..92232.69 rows=4626969 width=61) (actual
time=854.548..854.548 rows=4626969 loops=1)"
"Buckets: 65536  Batches: 128  Memory Usage: 3431kB"
"->  Seq Scan on title  (cost=0.00..92232.69 rows=4626969 width=61)
(actual time=0.005..327.588 rows=4626969 loops=1)"
"Planning time: 5.097 ms"
"Execution time: 27236.088 ms"

q2:
"Append  (cost=123209.65..3713445.65 rows=61473488 width=105) (actual
time=442.207..29713.621 rows=60918189 loops=1)"
"  ->  Gather  (cost=123209.65..2412792.77 rows=10639784 width=103) (actual
time=442.206..14634.427 rows=10046633 loops=1)"
"Workers Planned: 2"
"Workers Launched: 2"
"->  Hash Join  (cost=122209.65..1347814.37 rows=4433243 width=103)
(actual time=471.969..12527.840 rows=3348878 loops=3)"
"  Hash Cond: (cast_info.ci_movie_id = title.t_id)"
"  ->  Parallel Seq Scan on cast_info  (cost=0.00..693870.90
rows=25898190 width=42) (actual time=0.006..7302.679 rows=20679781 loops=3)"
"  ->  Hash  (cost=103800.11..103800.11 rows=792043 width=61)
(actual time=471.351..471.351 rows=775098 loops=3)"
"Buckets: 65536  Batches: 32  Memory Usage: 2515kB"
"->  Seq Scan on title  (cost=0.00..103800.11
rows=792043 width=61) (actual time=0.009..376.127 rows=775098 loops=3)"
"  Filter: (t_production_year >= 2015)"
"  Rows Removed by Filter: 3851871"
"  ->  Seq Scan on ci_t_15  (cost=0.00..1194255.04 rows=50833704 width=105)
(actual time=1.143..11967.391 rows=50871556 loops=1)"
"Planning time: 0.268 ms"
"Execution time: 31379.854 ms"

Due to using the materialized view I can reduce the amount of records going
into the hash join, lowering the time from 25786.264 msec to 12527.840 msec.
However, this is where my question comes in, this reduction is completely
negated by the cost of appending both results in the UNION ALL command.
I was wondering if this is normal behaviour.
In my mind, I wouldn't expect appending 2 resultsets to have such a
relative huge cost associated with it.
This is also why I asked what exactly a UNION ALL does to achieve its
functionality, to perhaps gain some insight in its cost.


With kind regards,

Mark

On Thu, 15 Aug 2019 at 21:22, Ibrar Ahmed  wrote:

>
>
> On Fri, Aug 16, 2019 at 12:16 AM <066ce...@free.fr> wrote:
>
>> Generally speaking, when executing UNION ; a DISTINCT is run afterward on
>> the resultset.
>>
>> So, if you're sure that each part of UNION cannot return a line returned
>> by another one, you may use UNION ALL, you'll cut the cost of the final
>> implicit DISTINCT.
>>
>>
>> - Mail original -
>> De: "Mark Pasterkamp" 
>> À: pgsql-hackers@lists.postgresql.org
>> Envoyé: Jeudi 15 Août 2019 20:37:06
>> Objet: UNION ALL
>>
>>
>> Dear all,
>>
>>
>> I was wondering if someone could hel

UNION ALL

2019-08-15 Thread Mark Pasterkamp
Dear all,

I was wondering if someone could help me understands what a union all
actually does.

For my thesis I am using Apache Calcite to rewrite queries into using
materialized views which I then give to a Postgres database.
For some queries, this means that they will be rewritten in a UNION ALL
style query between an expression and a table scan of a materialized view.
However, contrary to what I expected, the UNION ALL query is actually a lot
slower.

As an example, say I have 2 tables: actor and movie. Furthermore, there is
also a foreign key index on movie to actor.
I also have a materialized view with the join of these 2 tables for all
movies <= 2015 called A.
Now, if I want to query all entries in the join between actor and movie, I
would assume that a UNION ALL between the join of actor and movie for
movies >2015 and A is faster than executing the original query..
If I look at the explain analyze part, I can certainly see a reduction in
cost up until the UNION ALL part, which carries a respective cost more than
negating the cost reduction up to a point where I might as well not use the
existing materialized view.

I have some trouble understanding this phenomenon.
One thought which came to my mind was that perhaps UNION ALL might create a
temporary table containing both result sets, and then do a table scan and
return that result.
this would greatly increase IO cost which could attribute to the problem.
However, I am really not sure what UNION ALL actually does to append both
result sets so I was wondering if someone would be able to help me out with
this.


Mark