[PERFORM] query produces 1 GB temp file

2006-10-27 Thread Dirk Lutzebaeck

Hi,

here is a query which produces over 1G temp file in pgsql_tmp. This
is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
sort_mem and 320MB shared_mem.

Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
tables have been analyzed before.

Can some please explain why the temp file is so huge? I understand
there are a lot of rows. All relevant indices seem to be used.

Thanks in advance,

Dirk

EXPLAIN 
SELECT DISTINCT ON (ft.val_9, ft.created, ft.flatid) ft.docstart, ft.flatobj, 
bi.oid, bi.en
FROM bi, en, df AS ft, es
WHERE bi.rc=130170467
AND bi.en=ft.en
AND bi.co=117305223
AND bi.hide=FALSE
AND ft.en=en.oid
AND es.en=bi.en
AND es.co=bi.co
AND es.spec=122293729
AND (ft.val_2='DG' OR ft.val_2='SK')
AND ft.docstart=1
ORDER BY ft.val_9 ASC, ft.created DESC
LIMIT 1000 OFFSET 0;

 Limit  (cost=8346.75..8346.78 rows=3 width=1361)
   -  Unique  (cost=8346.75..8346.78 rows=3 width=1361)
 -  Sort  (cost=8346.75..8346.76 rows=3 width=1361)
   Sort Key: ft.val_9, ft.created, ft.flatid
   -  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361)
 -  Nested Loop  (cost=0.00..5757.17 rows=17 width=51)
   -  Nested Loop  (cost=0.00..5606.39 rows=30 
width=42)
 -  Index Scan using es_sc_index on es  
(cost=0.00..847.71 rows=301 width=8)
   Index Cond: ((spec = 122293729) AND (co 
= 117305223::oid))
 -  Index Scan using bi_env_index on bi  
(cost=0.00..15.80 rows=1 width=42)
   Index Cond: (outer.en = bi.en)
   Filter: ((rc = 130170467::oid) AND (co = 
117305223::oid) AND (hide = false))
   -  Index Scan using en_oid_index on en  
(cost=0.00..5.01 rows=1 width=9)
 Index Cond: (outer.en = en.oid)
 -  Index Scan using df_en on df ft  (cost=0.00..151.71 
rows=49 width=1322)
   Index Cond: (outer.en = ft.en)
   Filter: (((val_2 = 'DG'::text) OR (val_2 = 
'SK'::text)) AND (docstart = 1))
(17 rows)


--

EXPLAIN ANALYZE gives:


 Limit  (cost=8346.75..8346.78 rows=3 width=1361) (actual 
time=75357.465..75679.964 rows=1000 loops=1)
   -  Unique  (cost=8346.75..8346.78 rows=3 width=1361) (actual 
time=75357.459..75675.371 rows=1000 loops=1)
 -  Sort  (cost=8346.75..8346.76 rows=3 width=1361) (actual 
time=75357.448..75499.263 rows=22439 loops=1)
   Sort Key: ft.val_9, ft.created, ft.flatid
   -  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361) (actual 
time=34.104..18016.005 rows=703677 loops=1)
 -  Nested Loop  (cost=0.00..5757.17 rows=17 width=51) 
(actual time=0.467..3216.342 rows=48563 loops=1)
   -  Nested Loop  (cost=0.00..5606.39 rows=30 
width=42) (actual time=0.381..1677.014 rows=48563 loops=1)
 -  Index Scan using es_sc_index on es  
(cost=0.00..847.71 rows=301 width=8) (actual time=0.184..46.519 rows=5863 
loops=1)
   Index Cond: ((spec = 122293729) AND (co 
= 117305223::oid))
 -  Index Scan using bi_env_index on bi  
(cost=0.00..15.80 rows=1 width=42) (actual time=0.052..0.218 rows=8 loops=5863)
   Index Cond: (outer.en = bi.en)
   Filter: ((rc = 130170467::oid) AND (co = 
117305223::oid) AND (hide = false))
   -  Index Scan using en_oid_index on en  
(cost=0.00..5.01 rows=1 width=9) (actual time=0.015..0.019 rows=1 loops=48563)
 Index Cond: (outer.en = en.oid)
 -  Index Scan using df_en on df ft  (cost=0.00..151.71 
rows=49 width=1322) (actual time=0.038..0.148 rows=14 loops=48563)
   Index Cond: (outer.en = ft.en)
   Filter: (((val_2 = 'DG'::text) OR (val_2 = 
'SK'::text)) AND (docstart = 1))
 Total runtime: 81782.052 ms
(18 rows)


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] query produces 1 GB temp file

2006-10-27 Thread Merlin Moncure

On 2/5/05, Dirk Lutzebaeck [EMAIL PROTECTED] wrote:

here is a query which produces over 1G temp file in pgsql_tmp. This
is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
sort_mem and 320MB shared_mem.

Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
tables have been analyzed before.

Can some please explain why the temp file is so huge? I understand
there are a lot of rows. All relevant indices seem to be used.


how much memory have you set aside for sorting? also, this query will
likely run better in a more recent version of postgresql if thats
possible.

merlin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] query produces 1 GB temp file

2006-10-27 Thread Bricklen Anderson

Merlin Moncure wrote:

On 2/5/05, Dirk Lutzebaeck [EMAIL PROTECTED] wrote:

snip
Was the original message actually from 2/5/05?

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] query produces 1 GB temp file

2006-10-27 Thread Alexander Staubo
While I can't explain why PostgreSQL would use that memory, I  
recommend looking into tweaking the work_mem parameter. This setting  
specifies how much memory PostgreSQL on certain temporary data  
structures (hash tables, sort vectors) until it starts using  
temporary files. Quoting the docs:



work_mem (integer)
Specifies the amount of memory to be used by internal sort  
operations and hash tables before switching to temporary disk  
files. The value is specified in kilobytes, and defaults to 1024  
kilobytes (1 MB). Note that for a complex query, several sort or  
hash operations might be running in parallel; each one will be  
allowed to use as much memory as this value specifies before it  
starts to put data into temporary files. Also, several running  
sessions could be doing such operations concurrently. So the total  
memory used could be many times the value of work_mem; it is  
necessary to keep this fact in mind when choosing the value. Sort  
operations are used for ORDER BY, DISTINCT, and merge joins. Hash  
tables are used in hash joins, hash-based aggregation, and hash- 
based processing of IN subqueries.


Alexander.

On Feb 5, 2005, at 18:25 , Dirk Lutzebaeck wrote:


Hi,

here is a query which produces over 1G temp file in pgsql_tmp. This
is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
sort_mem and 320MB shared_mem.

Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
tables have been analyzed before.

Can some please explain why the temp file is so huge? I understand
there are a lot of rows. All relevant indices seem to be used.

Thanks in advance,

Dirk

EXPLAIN
SELECT DISTINCT ON (ft.val_9, ft.created, ft.flatid) ft.docstart,  
ft.flatobj, bi.oid, bi.en

FROM bi, en, df AS ft, es
WHERE bi.rc=130170467
AND bi.en=ft.en
AND bi.co=117305223
AND bi.hide=FALSE
AND ft.en=en.oid
AND es.en=bi.en
AND es.co=bi.co
AND es.spec=122293729
AND (ft.val_2='DG' OR ft.val_2='SK')
AND ft.docstart=1
ORDER BY ft.val_9 ASC, ft.created DESC
LIMIT 1000 OFFSET 0;

 Limit  (cost=8346.75..8346.78 rows=3 width=1361)
   -  Unique  (cost=8346.75..8346.78 rows=3 width=1361)
 -  Sort  (cost=8346.75..8346.76 rows=3 width=1361)
   Sort Key: ft.val_9, ft.created, ft.flatid
   -  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361)
 -  Nested Loop  (cost=0.00..5757.17 rows=17  
width=51)
   -  Nested Loop  (cost=0.00..5606.39  
rows=30 width=42)
 -  Index Scan using es_sc_index  
on es  (cost=0.00..847.71 rows=301 width=8)
   Index Cond: ((spec =  
122293729) AND (co = 117305223::oid))
 -  Index Scan using bi_env_index  
on bi  (cost=0.00..15.80 rows=1 width=42)
   Index Cond: (outer.en =  
bi.en)
   Filter: ((rc =  
130170467::oid) AND (co = 117305223::oid) AND (hide = false))
   -  Index Scan using en_oid_index on en   
(cost=0.00..5.01 rows=1 width=9)

 Index Cond: (outer.en = en.oid)
 -  Index Scan using df_en on df ft   
(cost=0.00..151.71 rows=49 width=1322)

   Index Cond: (outer.en = ft.en)
   Filter: (((val_2 = 'DG'::text) OR (val_2  
= 'SK'::text)) AND (docstart = 1))

(17 rows)


--

EXPLAIN ANALYZE gives:


 Limit  (cost=8346.75..8346.78 rows=3 width=1361) (actual  
time=75357.465..75679.964 rows=1000 loops=1)
   -  Unique  (cost=8346.75..8346.78 rows=3 width=1361) (actual  
time=75357.459..75675.371 rows=1000 loops=1)
 -  Sort  (cost=8346.75..8346.76 rows=3 width=1361)  
(actual time=75357.448..75499.263 rows=22439 loops=1)

   Sort Key: ft.val_9, ft.created, ft.flatid
   -  Nested Loop  (cost=0.00..8346.73 rows=3  
width=1361) (actual time=34.104..18016.005 rows=703677 loops=1)
 -  Nested Loop  (cost=0.00..5757.17 rows=17  
width=51) (actual time=0.467..3216.342 rows=48563 loops=1)
   -  Nested Loop  (cost=0.00..5606.39  
rows=30 width=42) (actual time=0.381..1677.014 rows=48563 loops=1)
 -  Index Scan using es_sc_index  
on es  (cost=0.00..847.71 rows=301 width=8) (actual  
time=0.184..46.519 rows=5863 loops=1)
   Index Cond: ((spec =  
122293729) AND (co = 117305223::oid))
 -  Index Scan using bi_env_index  
on bi  (cost=0.00..15.80 rows=1 width=42) (actual time=0.052..0.218  
rows=8 loops=5863)
   Index Cond: (outer.en =  
bi.en)
   Filter: ((rc =  
130170467::oid) AND (co = 117305223::oid) AND (hide = false))
   -  Index Scan using en_oid_index on en   
(cost=0.00..5.01 rows=1 width=9) (actual time=0.015..0.019 rows=1  

Re: [PERFORM] query produces 1 GB temp file

2006-10-27 Thread Dirk Lutzebäck




Hi,

I'm sorry but it look like my computer has resent older posts from me,
sigh...


Dirk

Alexander Staubo wrote:
While I can't explain why PostgreSQL would use that
memory, I recommend looking into tweaking the work_mem parameter. This
setting specifies how much memory PostgreSQL on certain temporary data
structures (hash tables, sort vectors) until it starts using temporary
files. Quoting the docs:
  
  
  work_mem (integer)

Specifies the amount of memory to be used by internal sort operations
and hash tables before switching to temporary disk files. The value is
specified in kilobytes, and defaults to 1024 kilobytes (1 MB). Note
that for a complex query, several sort or hash operations might be
running in parallel; each one will be allowed to use as much memory as
this value specifies before it starts to put data into temporary files.
Also, several running sessions could be doing such operations
concurrently. So the total memory used could be many times the value of
work_mem; it is necessary to keep this fact in mind when choosing the
value. Sort operations are used for ORDER BY, DISTINCT, and merge
joins. Hash tables are used in hash joins, hash-based aggregation, and
hash-based processing of IN subqueries.

  
  
Alexander.
  
  
On Feb 5, 2005, at 18:25 , Dirk Lutzebaeck wrote:
  
  
  Hi,


here is a query which produces over 1G temp file in pgsql_tmp. This

is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB

sort_mem and 320MB shared_mem.


Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All

tables have been analyzed before.


Can some please explain why the temp file is so huge? I understand

there are a lot of rows. All relevant indices seem to be used.


Thanks in advance,


Dirk


EXPLAIN

SELECT DISTINCT ON (ft.val_9, ft.created, ft.flatid) ft.docstart,
ft.flatobj, bi.oid, bi.en

FROM bi, en, df AS ft, es

WHERE bi.rc=130170467

AND bi.en=ft.en

AND bi.co=117305223

AND bi.hide=FALSE

AND ft.en=en.oid

AND es.en=bi.en

AND es.co=bi.co

AND es.spec=122293729

AND (ft.val_2='DG' OR ft.val_2='SK')

AND ft.docstart=1

ORDER BY ft.val_9 ASC, ft.created DESC

LIMIT 1000 OFFSET 0;


Limit (cost=8346.75..8346.78 rows=3 width=1361)

 - Unique (cost=8346.75..8346.78 rows=3 width=1361)

 - Sort (cost=8346.75..8346.76 rows=3 width=1361)

 Sort Key: ft.val_9, ft.created, ft.flatid

 - Nested Loop (cost=0.00..8346.73 rows=3
width=1361)

 - Nested Loop (cost=0.00..5757.17 rows=17
width=51)

 - Nested Loop (cost=0.00..5606.39
rows=30 width=42)

 - Index Scan using es_sc_index on
es (cost=0.00..847.71 rows=301 width=8)

 Index Cond: ((spec = 122293729)
AND (co = 117305223::oid))

 - Index Scan using bi_env_index
on bi (cost=0.00..15.80 rows=1 width=42)

 Index Cond: ("outer".en = bi.en)

 Filter: ((rc = 130170467::oid)
AND (co = 117305223::oid) AND (hide = false))

 - Index Scan using en_oid_index on en
(cost=0.00..5.01 rows=1 width=9)

 Index Cond: ("outer".en = en.oid)

 - Index Scan using df_en on df ft
(cost=0.00..151.71 rows=49 width=1322)

 Index Cond: ("outer".en = ft.en)

 Filter: (((val_2 = 'DG'::text) OR (val_2 =
'SK'::text)) AND (docstart = 1))

(17 rows)



--


EXPLAIN ANALYZE gives:



Limit (cost=8346.75..8346.78 rows=3 width=1361) (actual
time=75357.465..75679.964 rows=1000 loops=1)

 - Unique (cost=8346.75..8346.78 rows=3 width=1361) (actual
time=75357.459..75675.371 rows=1000 loops=1)

 - Sort (cost=8346.75..8346.76 rows=3 width=1361) (actual
time=75357.448..75499.263 rows=22439 loops=1)

 Sort Key: ft.val_9, ft.created, ft.flatid

 - Nested Loop (cost=0.00..8346.73 rows=3
width=1361) (actual time=34.104..18016.005 rows=703677 loops=1)

 - Nested Loop (cost=0.00..5757.17 rows=17
width=51) (actual time=0.467..3216.342 rows=48563 loops=1)

 - Nested Loop (cost=0.00..5606.39
rows=30 width=42) (actual time=0.381..1677.014 rows=48563 loops=1)

 - Index Scan using es_sc_index on
es (cost=0.00..847.71 rows=301 width=8) (actual time=0.184..46.519
rows=5863 loops=1)

 Index Cond: ((spec = 122293729)
AND (co = 117305223::oid))

 - Index Scan using bi_env_index
on bi (cost=0.00..15.80 rows=1 width=42) (actual time=0.052..0.218
rows=8 loops=5863)

 Index Cond: ("outer".en = bi.en)

 Filter: ((rc = 130170467::oid)
AND (co = 117305223::oid) AND (hide = false))

 - Index Scan using en_oid_index on en
(cost=0.00..5.01 rows=1 width=9) (actual time=0.015..0.019 rows=1
loops=48563)

 Index Cond: ("outer".en = en.oid)

 - Index Scan using df_en on df ft
(cost=0.00..151.71 rows=49 width=1322) (actual time=0.038..0.148
rows=14 loops=48563)

 Index Cond: ("outer".en = ft.en)

 Filter: (((val_2 = 'DG'::text) OR (val_2 =
'SK'::text)) AND (docstart = 1))

Total 

Re: [PERFORM] query produces 1 GB temp file

2006-10-27 Thread Thomas Burdairon
He is probably using IPOT (IP Other Time) : http://kadreg.free.fr/ipot/  :-) (sorry only french page ) On Oct 27, 2006, at 16:33, Bricklen Anderson wrote:Merlin Moncure wrote: On 2/5/05, Dirk Lutzebaeck [EMAIL PROTECTED] wrote: snipWas the original message actually from 2/5/05?---(end of broadcast)---TIP 1: if posting/reading through Usenet, please send an appropriate      subscribe-nomail command to [EMAIL PROTECTED] so that your      message can get through to the mailing list cleanly 

Re: [PERFORM] query produces 1 GB temp file

2006-10-27 Thread Scott Marlowe
On Sat, 2005-02-05 at 11:25, Dirk Lutzebaeck wrote:
 Hi,
 
 here is a query which produces over 1G temp file in pgsql_tmp. This
 is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
 sort_mem and 320MB shared_mem.

First step, upgrade to the latest 7.4.x version.  7.4.2 is an OLD
version of 7.4  I think the latest version is 7.4.13.

 Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
 tables have been analyzed before.

SNIP

 EXPLAIN ANALYZE gives:
 
 
  Limit  (cost=8346.75..8346.78 rows=3 width=1361) (actual 
 time=75357.465..75679.964 rows=1000 loops=1)
-  Unique  (cost=8346.75..8346.78 rows=3 width=1361) (actual 
 time=75357.459..75675.371 rows=1000 loops=1)
  -  Sort  (cost=8346.75..8346.76 rows=3 width=1361) (actual 
 time=75357.448..75499.263 rows=22439 loops=1)
Sort Key: ft.val_9, ft.created, ft.flatid
-  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361) 
 (actual time=34.104..18016.005 rows=703677 loops=1)
  -  Nested Loop  (cost=0.00..5757.17 rows=17 width=51) 
 (actual time=0.467..3216.342 rows=48563 loops=1)
-  Nested Loop  (cost=0.00..5606.39 rows=30 
 width=42) (actual time=0.381..1677.014 rows=48563 loops=1)
  -  Index Scan using es_sc_index on es  
 (cost=0.00..847.71 rows=301 width=8) (actual time=0.184..46.519 rows=5863 
 loops=1)
Index Cond: ((spec = 122293729) AND 
 (co = 117305223::oid))
  -  Index Scan using bi_env_index on bi  
 (cost=0.00..15.80 rows=1 width=42) (actual time=0.052..0.218 rows=8 
 loops=5863)
Index Cond: (outer.en = bi.en)
Filter: ((rc = 130170467::oid) AND (co 
 = 117305223::oid) AND (hide = false))
-  Index Scan using en_oid_index on en  
 (cost=0.00..5.01 rows=1 width=9) (actual time=0.015..0.019 rows=1 loops=48563)
  Index Cond: (outer.en = en.oid)
  -  Index Scan using df_en on df ft  (cost=0.00..151.71 
 rows=49 width=1322) (actual time=0.038..0.148 rows=14 loops=48563)
Index Cond: (outer.en = ft.en)
Filter: (((val_2 = 'DG'::text) OR (val_2 = 
 'SK'::text)) AND (docstart = 1))
  Total runtime: 81782.052 ms
 (18 rows)

Why do you have an index scan on en_oid_index that thinks it will return
1 row when it returns 48563, and one on df_en that thinks it will return
49 and returns 48563 as well?   Is this database analyzed often?  Are
oids even analyzed?  I'd really recommend switching off of them as they
complicate backups and restores.

If analyze doesn't help, you can try brute forcing off nested loops for
this query and see if that helps.  nested loop is really slow for large
numbers of rows.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] query produces 1 GB temp file

2005-02-09 Thread Christopher Kings-Lynne
I'm doing VACUUM ANALYZE once a night. Before the tests I did VACUUM and 
then ANALYZE.
I'd suggest once an hour on any resonably active database...
Chris
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] query produces 1 GB temp file

2005-02-08 Thread Dirk Lutzebaeck
Greg Stark wrote:
I gave a bunch of explain analyze select commands to test estimates for
individual columns. What results do they come up with? If those are inaccurate
then raising the statistics target is a good route. If those are accurate
individually but the combination is inaccurate then you have a more difficult
problem.
 

After  setting the new statistics target to 200 they did slightly better 
but not accurate. The results were attached to my last post. Here is a copy:


explain analyze select * from bi where rc=130170467;
QUERY PLAN
--- 

Seq Scan on bi (cost=0.00..41078.76 rows=190960 width=53) (actual 
time=0.157..3066.028 rows=513724 loops=1)
Filter: (rc = 130170467::oid)
Total runtime: 4208.663 ms
(3 rows)

explain analyze select * from bi where co=117305223;
QUERY PLAN
--- 

Seq Scan on bi (cost=0.00..41078.76 rows=603988 width=53) (actual 
time=0.021..3692.238 rows=945487 loops=1)
Filter: (co = 117305223::oid)
Total runtime: 5786.268 ms
(3 rows)

Here is the distribution of the data in bi:
select count(*) from bi;
1841966
select count(*) from bi where rc=130170467::oid;
513732
select count(*) from bi where co=117305223::oid;
945503


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] query produces 1 GB temp file

2005-02-08 Thread Dirk Lutzebaeck
Hi John,
thanks very much for your analysis. I'll probably need to reorganize 
some things.

Regards,
Dirk
John A Meinel wrote:
Dirk Lutzebaeck wrote:
Hi,
here is a query which produces over 1G temp file in pgsql_tmp. This
is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
sort_mem and 320MB shared_mem.
Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
tables have been analyzed before.
Can some please explain why the temp file is so huge? I understand
there are a lot of rows.
Thanks in advance,
Dirk
 

...
  -  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361) 
(actual time=34.104..18016.005 rows=703677 loops=1)
 

Well, there is this particular query where it thinks there will only 
be 3 rows, but in fact there are 703,677 of them. And the previous line:

-  Sort  (cost=8346.75..8346.76 rows=3 width=1361) (actual 
time=75357.448..75499.263 rows=22439 loops=1)
 

Seem to indicate that after sorting you still have 22,439 rows, which 
then gets pared down again down to 1000.

I'm assuming that the sort you are trying to do is extremely 
expensive. You are sorting 700k rows, which takes up too much memory 
(1GB), which forces it to create a temporary table, and write it out 
to disk.

I didn't analyze it a lot, but you might get a lot better performance 
from doing a subselect, rather than the query you wrote.

You are joining 4 tables (bi, en, df AS ft, es) I don't know which 
tables are what size. In the end, though, you don't really care about 
the en table or es tables (they aren't in your output).

So maybe one of you subselects could be:
where bi.en = (select en from es where es.co = bi.co and 
es.spec=122293729);

I'm pretty sure the reason you need 1GB of temp space is because at 
one point you have 700k rows. Is it possible to rewrite the query so 
that it does more filtering earlier? Your distinct criteria seems to 
filter it down to 20k rows. So maybe it's possible to do some sort of 
a distinct in part of the subselect, before you start joining against 
other tables.

If you have that much redundancy, you might also need to think of 
doing a different normalization.

Just some thoughts.
Also, I thought using the oid column wasn't really recommended, 
since in *high* volume databases they aren't even guaranteed to be 
unique. (I think it is a 32-bit number that rolls over.) Also on a 
database dump and restore, they don't stay the same, unless you take a 
lot of extra care that they are included in both the dump and the 
restore. I believe it is better to create your own id per table (say 
SERIAL or BIGSERIAL).

John
=:-

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] query produces 1 GB temp file

2005-02-08 Thread Dirk Lutzebaeck
Tom,
the orginal query has more output columns. I reduced it for readability. 
Specifically it returns a persitent object (flatobj column) which needs 
to be processed by the application as the returned result. The problem 
of the huge sort space usage seems to be that the flatobj is part of the 
row, so it used always copied in the sort algorithm I guess. When I drop 
the flatobj from the output columns the size of the temp space file 
drops dramatically. So I'll probably need to read flatobj after the 
sorting from the limited return result in a subselect.

Regards,
Dirk
Tom Lane wrote:
[EMAIL PROTECTED] (Dirk Lutzebaeck) writes:
 

SELECT DISTINCT ON (df.val_9, df.created, df.flatid) df.docindex, 
df.flatobj, bi.oid, bi.en
FROM bi,df
WHERE bi.rc=130170467
...
ORDER BY df.val_9 ASC, df.created DESC
LIMIT 1000 OFFSET 0
   

Just out of curiosity, what is this query supposed to *do* exactly?
It looks to me like it will give indeterminate results.  Practical
uses of DISTINCT ON generally specify more ORDER BY columns than
there are DISTINCT ON columns, because the extra columns determine
which rows have priority to survive the DISTINCT filter.  With the
above query, you have absolutely no idea which row will be output
for a given combination of val_9/created/flatid.
regards, tom lane
 


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] query produces 1 GB temp file

2005-02-08 Thread Dirk Lutzebaeck
John A Meinel wrote:
Dirk Lutzebaeck wrote:
Greg Stark wrote:
I gave a bunch of explain analyze select commands to test 
estimates for
individual columns. What results do they come up with? If those are 
inaccurate
then raising the statistics target is a good route. If those are 
accurate
individually but the combination is inaccurate then you have a more 
difficult
problem.

 

After  setting the new statistics target to 200 they did slightly 
better but not accurate. The results were attached to my last post. 
Here is a copy:


It does seem that setting the statistics to a higher value would help. 
Since rc=130170467 seems to account for almost 1/3 of the data. 
Probably you have other values that are much less common. So setting a 
high statistics target would help the planner realize that this value 
occurs at a different frequency from the other ones. Can you try other 
numbers and see what the counts are?
There is not much effect when increasing statistics target much higher. 
I guess this is because rc=130170467 takes a large portion of the column 
distribution.

I assume you did do a vacuum analyze after adjusting the statistics 
target.
Yes.
Also interesting that in the time it took you to place these queries, 
you had received 26 new rows.
Yes, it's a live system...
And finally, what is the row count if you do
explain analyze select * from bi where rc=130170467::oid and 
co=117305223::oid;
explain analyze select * from bi where rc=130170467::oid and 
co=117305223::oid;
   QUERY PLAN
---
Seq Scan on bi  (cost=0.00..43866.19 rows=105544 width=51) (actual 
time=0.402..3724.222 rows=513732 loops=1)
  Filter: ((rc = 130170467::oid) AND (co = 117305223::oid))

Well both columns data take about 1/4 of the whole table. There is not 
much distributed data. So it needs to do full scans...

If this is a lot less than say 500k, then probably you aren't going to 
be helped a lot. The postgresql statistics engine doesn't generate 
cross column statistics. It always assumes random distribution of 
data. So if two columns are correlated (or anti-correlated), it won't 
realize that.
105k, that seems to be may problem. No much random data. Does 8.0 
address this problem?

Even so, your original desire was to reduce the size of the 
intermediate step (where you have 700k rows). So you need to try and 
design a subselect on bi which is as restrictive as possible, so that 
you don't get all of these rows. With any luck, the planner will 
realize ahead of time that there won't be that many rows, and can use 
indexes, etc. But even if it doesn't use an index scan, if you have a 
query that doesn't use a lot of rows, then you won't need a lot of 
disk space.
I'll try that. What I have already noticed it that one of my output 
column is quite large so that's why it uses so much temp space. I'll 
probably need to sort without that output column  and  read it in 
afterwards using a subselect on the limted result.

Thanks for your help,
Dirk
John
=:-
explain analyze select * from bi where rc=130170467;
QUERY PLAN
--- 

Seq Scan on bi (cost=0.00..41078.76 rows=190960 width=53) (actual 
time=0.157..3066.028 rows=513724 loops=1)
Filter: (rc = 130170467::oid)
Total runtime: 4208.663 ms
(3 rows)

explain analyze select * from bi where co=117305223;
QUERY PLAN
--- 

Seq Scan on bi (cost=0.00..41078.76 rows=603988 width=53) (actual 
time=0.021..3692.238 rows=945487 loops=1)
Filter: (co = 117305223::oid)
Total runtime: 5786.268 ms
(3 rows)

Here is the distribution of the data in bi:
select count(*) from bi;
1841966
select count(*) from bi where rc=130170467::oid;
513732
select count(*) from bi where co=117305223::oid;
945503



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] query produces 1 GB temp file

2005-02-06 Thread John A Meinel
Dirk Lutzebaeck wrote:
Greg,
Thanks for your analysis. But I dont get any better after bumping 
STATISTICS target from 10 to 200.
explain analyze shows that the optimizer is still way off estimating 
the rows. Is this normal? It still produces a 1 GB temp file.
I simplified the query a bit, now only two tables are involved (bi, 
df). I also vacuumed.

Are you just doing VACUUM? Or are you doing VACUUM ANALYZE? You might 
also try VACUUM ANALYZE FULL (in the case that you have too many dead 
tuples in the table).

VACUUM cleans up, but doesn't adjust any planner statistics without ANALYZE.
John
=:-


signature.asc
Description: OpenPGP digital signature


[PERFORM] query produces 1 GB temp file

2005-02-05 Thread Dirk Lutzebaeck

Hi,

here is a query which produces over 1G temp file in pgsql_tmp. This
is on pgsql 7.4.2, RHEL 3.0, XEON MP machine with 32GB RAM, 300MB
sort_mem and 320MB shared_mem.

Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
tables have been analyzed before.

Can some please explain why the temp file is so huge? I understand
there are a lot of rows.

Thanks in advance,

Dirk

EXPLAIN 
SELECT DISTINCT ON (ft.val_9, ft.created, ft.flatid) ft.docstart, ft.docindex, 
ft.flatobj, bi.oid, bi.en
FROM bi, en, df AS ft, es
WHERE bi.rc=130170467
AND bi.en=ft.en
AND bi.co=117305223
AND bi.hide=FALSE
AND ft.en=en.oid
AND es.en=bi.en
AND es.co=bi.co
AND es.spec=122293729
AND (ft.val_2='DG' OR ft.val_2='SK')
AND ft.docstart=1
ORDER BY ft.val_9 ASC, ft.created DESC
LIMIT 1000 OFFSET 0;

 Limit  (cost=8346.75..8346.78 rows=3 width=1361)
   -  Unique  (cost=8346.75..8346.78 rows=3 width=1361)
 -  Sort  (cost=8346.75..8346.76 rows=3 width=1361)
   Sort Key: ft.val_9, ft.created, ft.flatid
   -  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361)
 -  Nested Loop  (cost=0.00..5757.17 rows=17 width=51)
   -  Nested Loop  (cost=0.00..5606.39 rows=30 
width=42)
 -  Index Scan using es_sc_index on es  
(cost=0.00..847.71 rows=301 width=8)
   Index Cond: ((spec = 122293729) AND (co 
= 117305223::oid))
 -  Index Scan using bi_env_index on bi  
(cost=0.00..15.80 rows=1 width=42)
   Index Cond: (outer.en = bi.en)
   Filter: ((rc = 130170467::oid) AND (co = 
117305223::oid) AND (hide = false))
   -  Index Scan using en_oid_index on en  
(cost=0.00..5.01 rows=1 width=9)
 Index Cond: (outer.en = en.oid)
 -  Index Scan using df_en on df ft  (cost=0.00..151.71 
rows=49 width=1322)
   Index Cond: (outer.en = ft.en)
   Filter: (((val_2 = 'DG'::text) OR (val_2 = 
'SK'::text)) AND (docstart = 1))
(17 rows)


--

EXPLAIN ANALYZE gives:


 Limit  (cost=8346.75..8346.78 rows=3 width=1361) (actual 
time=75357.465..75679.964 rows=1000 loops=1)
   -  Unique  (cost=8346.75..8346.78 rows=3 width=1361) (actual 
time=75357.459..75675.371 rows=1000 loops=1)
 -  Sort  (cost=8346.75..8346.76 rows=3 width=1361) (actual 
time=75357.448..75499.263 rows=22439 loops=1)
   Sort Key: ft.val_9, ft.created, ft.flatid
   -  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361) (actual 
time=34.104..18016.005 rows=703677 loops=1)
 -  Nested Loop  (cost=0.00..5757.17 rows=17 width=51) 
(actual time=0.467..3216.342 rows=48563 loops=1)
   -  Nested Loop  (cost=0.00..5606.39 rows=30 
width=42) (actual time=0.381..1677.014 rows=48563 loops=1)
 -  Index Scan using es_sc_index on es  
(cost=0.00..847.71 rows=301 width=8) (actual time=0.184..46.519 rows=5863 
loops=1)
   Index Cond: ((spec = 122293729) AND (co 
= 117305223::oid))
 -  Index Scan using bi_env_index on bi  
(cost=0.00..15.80 rows=1 width=42) (actual time=0.052..0.218 rows=8 loops=5863)
   Index Cond: (outer.en = bi.en)
   Filter: ((rc = 130170467::oid) AND (co = 
117305223::oid) AND (hide = false))
   -  Index Scan using en_oid_index on en  
(cost=0.00..5.01 rows=1 width=9) (actual time=0.015..0.019 rows=1 loops=48563)
 Index Cond: (outer.en = en.oid)
 -  Index Scan using df_en on df ft  (cost=0.00..151.71 
rows=49 width=1322) (actual time=0.038..0.148 rows=14 loops=48563)
   Index Cond: (outer.en = ft.en)
   Filter: (((val_2 = 'DG'::text) OR (val_2 = 
'SK'::text)) AND (docstart = 1))
 Total runtime: 81782.052 ms
(18 rows)


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] query produces 1 GB temp file

2005-02-05 Thread Greg Stark

Dirk Lutzebaeck [EMAIL PROTECTED] writes:

 Below is the query and results for EXPLAIN and EXPLAIN ANALYZE. All
 tables have been analyzed before.

Really? A lot of the estimates are very far off. If you really just analyzed
these tables immediately prior to the query then perhaps you should try
raising the statistics target on spec and co. Or is the problem that there's a
correlation between those two columns?

-  Nested Loop  (cost=0.00..8346.73 rows=3 width=1361) 
 (actual time=34.104..18016.005 rows=703677 loops=1)
  -  Nested Loop  (cost=0.00..5757.17 rows=17 width=51) 
 (actual time=0.467..3216.342 rows=48563 loops=1)
-  Nested Loop  (cost=0.00..5606.39 rows=30 
 width=42) (actual time=0.381..1677.014 rows=48563 loops=1)
  -  Index Scan using es_sc_index on es  
 (cost=0.00..847.71 rows=301 width=8) (actual time=0.184..46.519 rows=5863 
 loops=1)
Index Cond: ((spec = 122293729) AND 
 (co = 117305223::oid))

The root of your problem,. The optimizer is off by a factor of 20. It thinks
these two columns are much more selective than they are.

  -  Index Scan using bi_env_index on bi  
 (cost=0.00..15.80 rows=1 width=42) (actual time=0.052..0.218 rows=8 
 loops=5863)
Index Cond: (outer.en = bi.en)
Filter: ((rc = 130170467::oid) AND (co 
 = 117305223::oid) AND (hide = false))

It also thinks these three columns are much more selective than they are.

How accurate are its estimates if you just do these?

explain analyze select * from es where spec = 122293729
explain analyze select * from es where co = 117305223::oid
explain analyze select * from bi where rc = 130170467::oid
explain analyze select * from bi where co = 117305223
explain analyze select * from bi where hide = false

If they're individually accurate then you've run into the familiar problem of
needing cross-column statistics. If they're individually inaccurate then you
should try raising the targets on those columns with:

ALTER TABLE [ ONLY ] name [ * ]
ALTER [ COLUMN ] column SET STATISTICS integer

and reanalyzing.


Dirk Lutzebaeck [EMAIL PROTECTED] writes:

 Can some please explain why the temp file is so huge? I understand
 there are a lot of rows.

Well that I can't explain. 22k rows of width 1361 doesn't sound so big to me
either. The temporary table does need to store three copies of the records at
a given time, but still it sounds like an awful lot.


-- 
greg


---(end of broadcast)---
TIP 8: explain analyze is your friend