[PERFORM] Views With Unions

2003-07-31 Thread Christopher Browne
This is stepping back quite a while; let me point people to the thread
of 2003-02 where Mariusz Czu\x{0142}ada [EMAIL PROTECTED] was
looking for a way of optimizing a VIEW that was a UNION.

http://archives.postgresql.org/pgsql-performance/2003-02/msg00095.php

The subject has come up a few times through PostgreSQL history, and
I'd imagine to think I may have a little something new to offer to it.

Let's consider a table used to store log information:

create table log_table (
   request_time timestamp with time zone,
   object character varying,  -- What they asked for
   request_type character(8), -- What they did to it
   request_size integer,
   requestor inet,
   request_status integer,
   result_size integer,
   request_detail character varying
);
create index log_times on log_table(request_time);
create index log_object on log_table(object);

Every time something happens, an entry goes into this table.
Unfortunately, the table is likely to grow to tremendous size, over
time, and there are all sorts of troublesome things about purging it:

 - Fragmentation may waste space and destroy the usefulness of
indices;

 - Deleting data row by row will cause replication logic to go mad,
as triggers get invoked for every single row modified;

 - The action of deletion will draw the data we just decided was
_useless_ into memory, injuring cache utilization badly as we fill
the cache with trash.

The obvious thought: Create several tables, and join them together
into a view.  So instead of log_table being a table, we have
log_table_1 thru log_table_3, each with the schema describe above, and
define the view:

create view log_table as select * from log_table_1 union all 
 select * from log_table_2 union all 
 select * from log_table_3;

It's easy enough (modulo a little debugging and pl/pgsql work :-)) to
turn this into an updatable view so that inserts into log_table use a
different log table every (day|week|month).  And we can TRUNCATE the
eldest one, which is a cheap operation.

This approach also resembles the way the O guys handle partitioned
tables, so it's not merely about logs.

Unfortunately, selects on the VIEW are, at present, unable to make use
of the indices.  So if we want all log entries for June 11th, the
query:

  select * from log_table where request_time between 'june 11 2003' and
   'june 12 2003';

returns a plan:
Subquery Scan log_table  (cost=0.00..10950.26 rows=177126 width=314)
  -  Append  (cost=0.00..10950.26 rows=177126 width=314)
-  Subquery Scan *SELECT* 1  (cost=0.00..3089.07 rows=50307 width=71)
  -  Seq Scan on log_table_1  (cost=0.00..3089.07 rows=50307 width=71)
-  Subquery Scan *SELECT* 2  (cost=0.00..602.92 rows=9892 width=314)
  -  Seq Scan on log_table_2  (cost=0.00..602.92 rows=9892 width=314)
-  Subquery Scan *SELECT* 3  (cost=0.00..2390.09 rows=39209 width=314)
  -  Seq Scan on log_table_3  (cost=0.00..2390.09 rows=39209 width=314)

In effect, the query is materialized into:

select * from 
   (select * from log_table_1 union all select * from log_table_2
   union all select * from log_table_3) as merger
where [request_time between 'june 11 2003' and 'june 12 2003'];

What would perform better would be to attach the WHERE clause to each
of the union members.  (Everyone stop and sing Solidarity Forever
:-))

E.g.:

select * from
   (
   select * from log_table_1 where request_time between 'june 11 2003' and 'june 12 
2003' union all
   select * from log_table_2 where request_time between 'june 11 2003' and 'june 12 
2003' union all
   select * from log_table_3 where request_time between 'june 11 2003' and 'june 12 
2003' union all
   ) as merged_version;

Subquery Scan merged_version  (cost=0.00..947.04 rows=247 width=314) (actual 
time=55.86..1776.42 rows=20124 loops=1)
  -  Append  (cost=0.00..947.04 rows=247 width=314) (actual time=55.84..1483.60 
rows=20124 loops=1)
-  Subquery Scan *SELECT* 1  (cost=0.00..3.02 rows=1 width=71) (actual 
time=55.83..289.81 rows=3422 loops=1)
  -  Index Scan using log_table_1_trans_on_idx on log_table_1  
(cost=0.00..3.02 rows=1 width=71) (actual time=55.80..239.84 rows=3422 loops=1)
-  Subquery Scan *SELECT* 2  (cost=0.00..191.38 rows=49 width=314) (actual 
time=62.32..1115.15 rows=16702 loops=1)
  -  Index Scan using log_table_2_trans_on_idx on log_table_2  
(cost=0.00..191.38 rows=49 width=314) (actual time=62.29..873.63 rows=16702 loops=1)
-  Subquery Scan *SELECT* 3  (cost=0.00..752.64 rows=196 width=314) (actual 
time=26.69..26.69 rows=0 loops=1)
  -  Index Scan using log_table_3_trans_on_idx on log_table_3  
(cost=0.00..752.64 rows=196 width=314) (actual time=26.69..26.69 rows=0 loops=1)
Total runtime: 1806.39 msec

Which is nice and quick, as it cuts each set down to size _before_
merging them.


Re: [PERFORM] Views With Unions

2003-07-31 Thread Rajesh Kumar Mallah
Stephan Szabo wrote:

On Thu, 31 Jul 2003, Christopher Browne wrote:

 

 select * from log_table where request_time between 'june 11 2003' and
  'june 12 2003';
returns a plan:
Subquery Scan log_table  (cost=0.00..10950.26 rows=177126 width=314)
 -  Append  (cost=0.00..10950.26 rows=177126 width=314)
   -  Subquery Scan *SELECT* 1  (cost=0.00..3089.07 rows=50307 width=71)
 -  Seq Scan on log_table_1  (cost=0.00..3089.07 rows=50307 width=71)
   -  Subquery Scan *SELECT* 2  (cost=0.00..602.92 rows=9892 width=314)
 -  Seq Scan on log_table_2  (cost=0.00..602.92 rows=9892 width=314)
   -  Subquery Scan *SELECT* 3  (cost=0.00..2390.09 rows=39209 width=314)
 -  Seq Scan on log_table_3  (cost=0.00..2390.09 rows=39209 width=314)
   

What version are you using?  In 7.3 and up it should be willing to
consider moving the clause down, unless there's something like a type
mismatch (because in that case it may not be equivalent without a bunch
more work on the clause).
Dear Chris,

I had the same problem(type mismatch)  and it was  solved finally. 
check the list
factoring problem ...   subject only 2 weeks back .

regds
mallah.



---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match
 



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster