Re: [HACKERS] Planner estimates and cast operations ,...

2006-09-04 Thread Tom Lane
Hans-Juergen Schoenig [EMAIL PROTECTED] writes:
 consider the following:

   SELECT some_timestamp::date FROM very_large_table GROUP BY  
 some_timestamp::date

 my very_large_table is around 1billion entries.
 the problem is: the planner has a problem here as it is taking the  
 (correct) estimates for timestamp. this avoids a HashAggregate  
 because the dataset seems to large for work_mem.
 what the planner cannot know is that the number of days is quite  
 limited (in my case around 1000 different values).
 i wonder how to teach the planner to take the cast into consideration.

Create an index on that expression.

regression=# create table foo(x) as select x * '864 sec'::interval + 
now()::timestamp from generate_series(1,1) x;
SELECT
regression=# analyze foo;
ANALYZE
regression=# explain select x::date from foo group by x::date;
  QUERY PLAN
---
 HashAggregate  (cost=205.00..330.00 rows=1 width=8)
   -  Seq Scan on foo  (cost=0.00..180.00 rows=1 width=8)
(2 rows)

regression=# create index fooi on foo((x::date));
CREATE INDEX
regression=# analyze foo;
ANALYZE
regression=# explain select x::date from foo group by x::date;
  QUERY PLAN
---
 HashAggregate  (cost=205.00..206.26 rows=101 width=8)
   -  Seq Scan on foo  (cost=0.00..180.00 rows=1 width=8)
(2 rows)

regression=#

I had to cheat a little bit here: I tried to do this example with a
timestamptz column, and the index creation failed because timestamptz to
date isn't immutable (it depends on TimeZone).  If yours is too, you
could perhaps do something involving AT TIME ZONE to generate an
immutable conversion to date.

It would perhaps make sense to provide a way to cue ANALYZE to compute
stats on expressions that aren't actually being indexed, but I see no
good reason to limit our attention to cast expressions.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Planner estimates and cast operations ,...

2006-09-04 Thread Hans-Juergen Schoenig

hi tom ...

i thought about creating an index on the expression but the problem  
is that this is hardly feasable.
in 8.0 (what i have here) this would block the table and i would run  
out of disk space as well. this is a 600 gb biest :(


what about the planner approach?
this would solve the problem for some other issues as well. an index  
might not be flexible enough :(.


many thanks,

hans


On Sep 4, 2006, at 4:57 PM, Tom Lane wrote:


Hans-Juergen Schoenig [EMAIL PROTECTED] writes:

consider the following:



SELECT some_timestamp::date FROM very_large_table GROUP BY
some_timestamp::date



my very_large_table is around 1billion entries.
the problem is: the planner has a problem here as it is taking the
(correct) estimates for timestamp. this avoids a HashAggregate
because the dataset seems to large for work_mem.
what the planner cannot know is that the number of days is quite
limited (in my case around 1000 different values).
i wonder how to teach the planner to take the cast into  
consideration.


Create an index on that expression.

regression=# create table foo(x) as select x * '864 sec'::interval  
+ now()::timestamp from generate_series(1,1) x;

SELECT
regression=# analyze foo;
ANALYZE
regression=# explain select x::date from foo group by x::date;
  QUERY PLAN
---
 HashAggregate  (cost=205.00..330.00 rows=1 width=8)
   -  Seq Scan on foo  (cost=0.00..180.00 rows=1 width=8)
(2 rows)

regression=# create index fooi on foo((x::date));
CREATE INDEX
regression=# analyze foo;
ANALYZE
regression=# explain select x::date from foo group by x::date;
  QUERY PLAN
---
 HashAggregate  (cost=205.00..206.26 rows=101 width=8)
   -  Seq Scan on foo  (cost=0.00..180.00 rows=1 width=8)
(2 rows)

regression=#

I had to cheat a little bit here: I tried to do this example with a
timestamptz column, and the index creation failed because  
timestamptz to

date isn't immutable (it depends on TimeZone).  If yours is too, you
could perhaps do something involving AT TIME ZONE to generate an
immutable conversion to date.

It would perhaps make sense to provide a way to cue ANALYZE to compute
stats on expressions that aren't actually being indexed, but I see no
good reason to limit our attention to cast expressions.

regards, tom lane



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Planner estimates and cast operations ,...

2006-09-04 Thread Bruno Wolff III
On Mon, Sep 04, 2006 at 17:19:37 +0200,
  Hans-Juergen Schoenig [EMAIL PROTECTED] wrote:
 
 i thought about creating an index on the expression but the problem  
 is that this is hardly feasable.
 in 8.0 (what i have here) this would block the table and i would run  

That may be hard to deal with.

 out of disk space as well. this is a 600 gb biest :(

I wouldn't expect this to be a problem. If you have 10^9 rows, I would expect
the index to be less than 10% of you current size. If you are so close to
your disk space limit that that is a problem, you have a problem in any case.

 
 what about the planner approach?
 this would solve the problem for some other issues as well. an index  
 might not be flexible enough :(.

If you disable sorting you might be able to get it to switch plans. Lying
about the amount of work memory so that the planner thinks the hash
will fit in memory despite its misguessing the number of buckets might also
help.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Planner estimates and cast operations ,...

2006-09-04 Thread Hans-Juergen Schoenig


On Sep 4, 2006, at 7:04 PM, Bruno Wolff III wrote:


On Mon, Sep 04, 2006 at 17:19:37 +0200,
  Hans-Juergen Schoenig [EMAIL PROTECTED] wrote:


i thought about creating an index on the expression but the problem
is that this is hardly feasable.
in 8.0 (what i have here) this would block the table and i would run


That may be hard to deal with.




it is ...
but the problem is not primarily that i have some problem with a  
certain query. somehow this can be solved somehow. i am thinking  
about GROUP BY and estimates in general here ...

just wondering if there is a chance to improve ...


out of disk space as well. this is a 600 gb biest :(


I wouldn't expect this to be a problem. If you have 10^9 rows, I  
would expect
the index to be less than 10% of you current size. If you are so  
close to
your disk space limit that that is a problem, you have a problem in  
any case.





the index itself is not too large but when building it up it is  
written several times. it is not funny when dealing with so much  
data ...





what about the planner approach?
this would solve the problem for some other issues as well. an index
might not be flexible enough :(.


If you disable sorting you might be able to get it to switch plans.  
Lying

about the amount of work memory so that the planner thinks the hash
will fit in memory despite its misguessing the number of buckets  
might also

help.



setting work_mem to 2gb does not help here ;)
set it to the max value on 8.0.
this was my first try too.
the problem is - there is no magic switch to mislead the planner a  
little without hacking the system stats (which is not what people  
should do i would say ;) ).


my question is: is adding hooks for selectivity a feasable way of  
dealing with things like that?


hans





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Planner estimates and cast operations ,...

2006-09-04 Thread Bruno Wolff III
On Mon, Sep 04, 2006 at 19:09:16 +0200,
  Hans-Juergen Schoenig [EMAIL PROTECTED] wrote:
 
 setting work_mem to 2gb does not help here ;)
 set it to the max value on 8.0.
 this was my first try too.
 the problem is - there is no magic switch to mislead the planner a  
 little without hacking the system stats (which is not what people  
 should do i would say ;) ).

Did you combine that with telling it not to use sorts? I am not sure that
will really work for GROUP BY, but it is probably an easy test. You can
do an explain to see what it will try without actually running the query
in case it picks the poor plan again.

 my question is: is adding hooks for selectivity a feasable way of  
 dealing with things like that?

I think the expectation is that you create a functional index and that's
how you would tell the system to keep stats for particular functions. I
don't think data on the most common values are kept now for functional
indexes, but the index itself will still have clues about the data.

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly