Re: [PERFORM] Querying 19million records very slowly

2005-06-23 Thread Tobias Brox
>  I'm running FreeBSD 4.11, and im editing the file in 
> /usr/local/etc/postgresql.conf, but it doesnt help.

On my system the 'live' config file resides in
/var/lib/postgresql/data/postgresql.conf - maybe you have them in
/usr/local/var/lib ...

-- 
Tobias Brox, +86-13521622905
Nordicbet, IT dept

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Querying 19million records very slowly

2005-06-23 Thread Kjell Tore Fossbakk
-I also changed now() to current_time, which increased performance quite good. I need to make further tests, before I'll post any results.
-I tried to change now()- interval 'x hours' to like 2005-06-22 16:00:00+02. This also increased the performance.
 
changing to time > '2005-06-22 16:00:00+02' (or what ever date is 24 hours back) or time > current_time - interval '24 hours' will be used.
 
I'm running FreeBSD 4.11, and im editing the file in /usr/local/etc/postgresql.conf, but it doesnt help. When i start up "psql database", none of the options are changed (with a restart of the postmaster). I cant find a '--configuration=path/file' option for the postmaster either...

 
Kjell Tore
 
On 6/22/05, Tobias Brox <[EMAIL PROTECTED]> wrote:
[Kjell Tore Fossbakk - Wed at 07:41:54AM -0700]> I cant get the config file to load into my postgres. that's the
> problem. I want to set it to 10k, but it is only still at 1000... I> save the file and restart the service..>> yes, i ment 'pg_ctl reload', sry about that one.Classical problem, a bit depending on the distro you are using.
The "master" file usually resides in /etc/postgresql while the actual fileused usually resides in /var/lib/postgres/data ... or something.  Somedistros copies over the file (such that one always should edit the file in
/etc) others don't (thus you either have to do that your self, or edit thefile in the database data directory.--Tobias Brox, +86-13521622905Nordicbet, IT dept


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Tobias Brox
[Kjell Tore Fossbakk - Wed at 07:41:54AM -0700]
> I cant get the config file to load into my postgres. that's the
> problem. I want to set it to 10k, but it is only still at 1000... I
> save the file and restart the service..
> 
> yes, i ment 'pg_ctl reload', sry about that one.

Classical problem, a bit depending on the distro you are using.

The "master" file usually resides in /etc/postgresql while the actual file
used usually resides in /var/lib/postgres/data ... or something.  Some
distros copies over the file (such that one always should edit the file in
/etc) others don't (thus you either have to do that your self, or edit the
file in the database data directory.

-- 
Tobias Brox, +86-13521622905
Nordicbet, IT dept

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Stefan Weiss
On 2005-06-22 10:55, Michael Glaesemann wrote:
> There has been discussion in the past on storing the time zone name  
> with the timestamptz as well, though no one has implemented this yet.

The reason for this may be that time zone names (abbreviations) are not
unique. For example, "ECT" can mean "Ecuador Time" (offset -05) or
"Eastern Caribbean Time" (offset -04).

http://www.worldtimezone.com/wtz-names/timezonenames.html

cheers,
stefan

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Kjell Tore Fossbakk
I cant get the config file to load into my postgres. that's the
problem. I want to set it to 10k, but it is only still at 1000... I
save the file and restart the service..

yes, i ment 'pg_ctl reload', sry about that one.

kjell tore

On 6/22/05, Bricklen Anderson <[EMAIL PROTECTED]> wrote:
> >> enable_seqscan = false
> >> enable_indexscan = true
> >> ..
> >> effective_cache_size = 1
> >> random_page_cost = 2
> >> ..
> >>
> >> I save the file, type 'pg_crl reload' then enter 'psql database'.
> >>
> >> argus=> show effective_cache_size ;
> >>  effective_cache_size
> >> --
> >>  1000
> >> (1 row)
> 
> I assume that 'pg_crl' is a typo? That should read 'pg_ctl reload'
> Also, you said that your effective_cache_size = 1, yet when you SHOW
> it,
> it's only 1000. A cut 'n paste error, or maybe your erroneous "pg_crl"
> didn't
> trigger the reload?
> 
> -- 
> ___
> 
> This e-mail may be privileged and/or confidential, and the sender does
> not waive any related rights and obligations. Any distribution, use or
> copying of this e-mail or the information it contains by other than an
> intended recipient is unauthorized. If you received this e-mail in
> error, please advise me (by return e-mail or otherwise) immediately.
> ___
>

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Paul Ramsey
OK, so the planner is in fact making a mistake (I think).  Try turning  
down your random_page_cost a little. It defaults at 4.0, see if 2.0  
works "right". (Careful, move these things around too much for one  
query, you will wreck others.)  4.0 is a little large for almost all  
modern hardware, so see if moving it down a little makes things  
somewhat smarter.


P

On Wednesday, June 22, 2005, at 12:45 AM, Kjell Tore Fossbakk wrote:


database=> set enable_seqscan to on;
SET
Time: 0.34 ms



database=> explain analyze select count(*) from test where p1=53;
  QUERY PLAN
--- 


 Aggregate  (cost=522824.50..522824.50 rows=1 width=0) (actual
time=56380.72..56380.72 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..517383.30 rows=2176479 width=0)
(actual time=9.61..47677.48 rows=2220746 loops=1)
 Filter: (p1 = 53)
 Total runtime: 56380.79 msec
(4 rows)

Time: 56381.40 ms



database=> explain analyze select count(*) from test where p1=53 and
time > now() - interval '24 hours' ;
   QUERY PLAN
--- 
-

 Aggregate  (cost=661969.01..661969.01 rows=1 width=0) (actual
time=45787.02..45787.02 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493 width=0)
(actual time=37799.32..45613.58 rows=42424 loops=1)
 Filter: ((p1 = 53) AND ("time" > (now() - '1 day'::interval)))
 Total runtime: 45787.09 msec
(4 rows)

Time: 45787.79 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=53 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 QUERY
PLAN
--- 
-

 Aggregate  (cost=755116.97..760558.17 rows=72549 width=8) (actual
time=46040.63..46717.61 rows=23 loops=1)
   ->  Group  (cost=755116.97..758744.44 rows=725493 width=8) (actual
time=46022.06..46548.84 rows=42407 loops=1)
 ->  Sort  (cost=755116.97..756930.70 rows=725493 width=8)
(actual time=46022.04..46198.94 rows=42407 loops=1)
   Sort Key: date_trunc('hour'::text, "time")
   ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493
width=8) (actual time=37784.91..45690.88 rows=42407 loops=1)
 Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
 Total runtime: 46718.43 msec
(7 rows)

Time: 46719.44 ms



database=> create index test_time_p1_idx on test(time,p1) ;
CREATE INDEX
Time: 178926.02 ms

database=> vacuum analyze test ;
VACUUM
Time: 73058.33 ms

database=> \d test
   Table "public.test"
  Column |   Type   | Modifiers
-+--+---
 time| timestamp with time zone |
 source  | inet |
 destination | inet |
 p1  | integer  |
 p2  | integer  |



database=> \di
 public | test_time_idx| index | database | test
 public | test_source_idx  | index | database | test
 public | test_destination_idx | index | database | test
 public | test_p1_idx  | index | database | test
 public | test_p2_idx  | index | database | test
 public | test_time_p1_idx | index | database | test



database=> set enable_seqscan to off ;
SET
Time: 0.28 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=53 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=2315252.66..2320767.17 rows=73527 width=8) (actual
time=2081.15..2720.44 rows=23 loops=1)
   ->  Group  (cost=2315252.66..2318929.00 rows=735268 width=8)
(actual time=2079.76..2564.22 rows=41366 loops=1)
 ->  Sort  (cost=2315252.66..2317090.83 rows=735268 width=8)
(actual time=2079.74..2243.32 rows=41366 loops=1)
   Sort Key: date_trunc('hour'::text, "time")
   ->  Index Scan using test_time_p1_idx on test
(cost=0.00..2218878.46 rows=735268 width=8) (actual
time=29.50..1774.52 rows=41366 loops=1)
 Index Cond: (("time" > (now() - '1
day'::interval)) AND (p1 = 53))
 Total runtime: 2735.42 msec

Time: 2736.48 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=80 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=1071732.15..1074305.59 rows=34313 width=8) (actual
time=6353.93..7321.99 rows=22 loops=1)
   ->  Group  (cost=1071732.15..1073447.77 rows=343125 width=8)
(actual time=6323.76..7078.10 rows=64267 loops=1)
 ->  Sort

Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Tom Lane
Tobias Brox <[EMAIL PROTECTED]> writes:
>> time (datetime) > '2005-06-21 10:36:22+02'
>> or 
>> time (timestamp) > 'some timestamp pointing to yesterday'

> If I have understood it correctly, the planner will recognize the timestamp
> and compare it with the statistics in the first example but not in the
> second, and thus it will be more likely to use index scan on the first one
> and seqscan on the second.

That statement is true for releases before 8.0.  Kjell has not at any
point told us what PG version he is running, unless I missed it...

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Tobias Brox
[Kjell Tore Fossbakk - Wed at 11:10:42AM +0200]
> Which one does Pg read fastes? Does he convert datetime in the table,
> then my where clause and check, for each row? How does he compare a
> datetime with a datetime? Timestamp are easy, large number bigger than
> another large number..
> 
> time (datetime) > '2005-06-21 10:36:22+02'
> 
> or 
> 
> time (timestamp) > 'some timestamp pointing to yesterday'

If I have understood it correctly, the planner will recognize the timestamp
and compare it with the statistics in the first example but not in the
second, and thus it will be more likely to use index scan on the first one
and seqscan on the second.

-- 
Tobias Brox, +86-13521622905
Nordicbet, IT dept

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Kjell Tore Fossbakk
> Try to type in '2005-06-21 16:36:22+08' directly in the query, and see if it
> makes changes.  Or probably '2005-06-21 10:36:22+02' in your case ;-)

Which one does Pg read fastes? Does he convert datetime in the table,
then my where clause and check, for each row? How does he compare a
datetime with a datetime? Timestamp are easy, large number bigger than
another large number..

time (datetime) > '2005-06-21 10:36:22+02'

or 

time (timestamp) > 'some timestamp pointing to yesterday'

Hmm.. I cant find any doc that describes this very good.


On 6/22/05, Michael Glaesemann <[EMAIL PROTECTED]> wrote:
> 
> On Jun 22, 2005, at 5:39 PM, Tobias Brox wrote:
> 
> > (btw, does postgresql really handles timezones?  '+02' is quite  
> > different
> > from 'CET', which will be obvious sometime in the late autoumn...)
> 
> Yes, it does. It doesn't (currently) record the time zone name, but  
> rather only the offset from UTC. If a time zone name (rather than UTC  
> offset) is given, it is converted to the UTC offset *at that  
> timestamptz* when it is stored. For time zones that take into account  
> DST, their UTC offset changes during the year, and PostgreSQL records  
> the equivalent UTC offset for the appropriate timestamptz values.
> 
> There has been discussion in the past on storing the time zone name  
> with the timestamptz as well, though no one has implemented this yet.
> 
> Michael Glaesemann
> grzm myrealbox com
> 
> 
> ---(end of broadcast)---
> TIP 6: Have you searched our list archives?
> 
>http://archives.postgresql.org
>

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Michael Glaesemann


On Jun 22, 2005, at 5:39 PM, Tobias Brox wrote:

(btw, does postgresql really handles timezones?  '+02' is quite  
different

from 'CET', which will be obvious sometime in the late autoumn...)


Yes, it does. It doesn't (currently) record the time zone name, but  
rather only the offset from UTC. If a time zone name (rather than UTC  
offset) is given, it is converted to the UTC offset *at that  
timestamptz* when it is stored. For time zones that take into account  
DST, their UTC offset changes during the year, and PostgreSQL records  
the equivalent UTC offset for the appropriate timestamptz values.


There has been discussion in the past on storing the time zone name  
with the timestamptz as well, though no one has implemented this yet.


Michael Glaesemann
grzm myrealbox com


---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Tobias Brox
[Kjell Tore Fossbakk - Wed at 10:18:30AM +0200]
> I'll test the use of current_timestamp, rather than now(). I am not
> sure if Pg can do a match between a fixed timestamp and a datetime?

I have almost all my experience with timestamps wo timezones, but ... isn't
that almost the same as the timedate type?

> time > current_timestamp - interval '24 hours',
> when time is -mm-dd hh-mm-ss+02, like 2005-06-22 16:00:00+02.

Try to type in '2005-06-21 16:36:22+08' directly in the query, and see if it
makes changes.  Or probably '2005-06-21 10:36:22+02' in your case ;-)

(btw, does postgresql really handles timezones?  '+02' is quite different
from 'CET', which will be obvious sometime in the late autoumn...)

-- 
Tobias Brox, +86-13521622905
Nordicbet, IT dept

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Kjell Tore Fossbakk
Appreciate your time, Mr Brox.

I'll test the use of current_timestamp, rather than now(). I am not
sure if Pg can do a match between a fixed timestamp and a datetime?

time > current_timestamp - interval '24 hours',
when time is -mm-dd hh-mm-ss+02, like 2005-06-22 16:00:00+02.

If Pg cant do it, and current_time is faster, i could possibly convert
the time field in my database to timestamp, and insert all rows as
timestamp rather than a timedate. But that is some script to work over
19 mill rows, so I need to know if that will give me any more speed..

Kjell Tore.

On 6/22/05, Tobias Brox <[EMAIL PROTECTED]> wrote:
> [Kjell Tore Fossbakk - Wed at 09:45:22AM +0200]
> > database=> explain analyze select count(*) from test where p1=53 and
> > time > now() - interval '24 hours' ;
> 
> Sorry to say that I have not followed the entire thread neither read the
> entire email I'm replying to, but I have a quick hint on this one (ref my
> earlier thread about timestamp indices) - the postgresql planner will
> generally behave smarter when using a fixed timestamp (typically generated
> by the app server) than logics based on now().
> 
> One of my colleagues also claimed that he found the usage of
> localtimestamp faster than now().
> 
> -- 
> Tobias Brox, +86-13521622905
> Nordicbet, IT dept
>

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Tobias Brox
[Kjell Tore Fossbakk - Wed at 09:45:22AM +0200]
> database=> explain analyze select count(*) from test where p1=53 and
> time > now() - interval '24 hours' ;

Sorry to say that I have not followed the entire thread neither read the
entire email I'm replying to, but I have a quick hint on this one (ref my
earlier thread about timestamp indices) - the postgresql planner will
generally behave smarter when using a fixed timestamp (typically generated
by the app server) than logics based on now().

One of my colleagues also claimed that he found the usage of
localtimestamp faster than now().

-- 
Tobias Brox, +86-13521622905
Nordicbet, IT dept

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Kjell Tore Fossbakk
database=> set enable_seqscan to on;
SET
Time: 0.34 ms



database=> explain analyze select count(*) from test where p1=53;
  QUERY PLAN
---
 Aggregate  (cost=522824.50..522824.50 rows=1 width=0) (actual
time=56380.72..56380.72 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..517383.30 rows=2176479 width=0)
(actual time=9.61..47677.48 rows=2220746 loops=1)
 Filter: (p1 = 53)
 Total runtime: 56380.79 msec
(4 rows)

Time: 56381.40 ms



database=> explain analyze select count(*) from test where p1=53 and
time > now() - interval '24 hours' ;
   QUERY PLAN

 Aggregate  (cost=661969.01..661969.01 rows=1 width=0) (actual
time=45787.02..45787.02 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493 width=0)
(actual time=37799.32..45613.58 rows=42424 loops=1)
 Filter: ((p1 = 53) AND ("time" > (now() - '1 day'::interval)))
 Total runtime: 45787.09 msec
(4 rows)

Time: 45787.79 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=53 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 QUERY
PLAN

 Aggregate  (cost=755116.97..760558.17 rows=72549 width=8) (actual
time=46040.63..46717.61 rows=23 loops=1)
   ->  Group  (cost=755116.97..758744.44 rows=725493 width=8) (actual
time=46022.06..46548.84 rows=42407 loops=1)
 ->  Sort  (cost=755116.97..756930.70 rows=725493 width=8)
(actual time=46022.04..46198.94 rows=42407 loops=1)
   Sort Key: date_trunc('hour'::text, "time")
   ->  Seq Scan on test  (cost=0.00..660155.28 rows=725493
width=8) (actual time=37784.91..45690.88 rows=42407 loops=1)
 Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
 Total runtime: 46718.43 msec
(7 rows)

Time: 46719.44 ms



database=> create index test_time_p1_idx on test(time,p1) ;
CREATE INDEX
Time: 178926.02 ms

database=> vacuum analyze test ;
VACUUM
Time: 73058.33 ms

database=> \d test
   Table "public.test"
  Column |   Type   | Modifiers
-+--+---
 time| timestamp with time zone |
 source  | inet |
 destination | inet |
 p1  | integer  |
 p2  | integer  |



database=> \di
 public | test_time_idx| index | database | test
 public | test_source_idx  | index | database | test
 public | test_destination_idx | index | database | test
 public | test_p1_idx  | index | database | test
 public | test_p2_idx  | index | database | test
 public | test_time_p1_idx | index | database | test



database=> set enable_seqscan to off ;
SET
Time: 0.28 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=53 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=2315252.66..2320767.17 rows=73527 width=8) (actual
time=2081.15..2720.44 rows=23 loops=1)
   ->  Group  (cost=2315252.66..2318929.00 rows=735268 width=8)
(actual time=2079.76..2564.22 rows=41366 loops=1)
 ->  Sort  (cost=2315252.66..2317090.83 rows=735268 width=8)
(actual time=2079.74..2243.32 rows=41366 loops=1)
   Sort Key: date_trunc('hour'::text, "time")
   ->  Index Scan using test_time_p1_idx on test 
(cost=0.00..2218878.46 rows=735268 width=8) (actual
time=29.50..1774.52 rows=41366 loops=1)
 Index Cond: (("time" > (now() - '1
day'::interval)) AND (p1 = 53))
 Total runtime: 2735.42 msec

Time: 2736.48 ms



database=> explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=80 and time>now()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=1071732.15..1074305.59 rows=34313 width=8) (actual
time=6353.93..7321.99 rows=22 loops=1)
   ->  Group  (cost=1071732.15..1073447.77 rows=343125 width=8)
(actual time=6323.76..7078.10 rows=64267 loops=1)
 ->  Sort  (cost=1071732.15..1072589.96 rows=343125 width=8)
(actual time=6323.75..6579.42 rows=64267 loops=1)
   Sort Key: date_trunc('hour'::text, "time")
   ->  Index Scan using test_time_p1_idx on test 
(cost=0.00..1035479.58 rows=343125 width=8) (actual time=0.20..5858.67
rows=64267 loops=1)
 Index Cond: (("time" > (now() - '1
day'::interval)) AND (p1 = 80))
 Total runtime: 7322.82 msec

Time: 7323.90 ms




Re: [PERFORM] Querying 19million records very slowly

2005-06-21 Thread PFC


use CURRENT_TIME which is a constant instead of now() which is not  
considered constant... (I think)


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Querying 19million records very slowly

2005-06-21 Thread Paul Ramsey

Some tips:

- EXPLAIN ANALYZE provides a more useful analysis of a slow query, 
because it gives both the estimate and actual times/rows for each step 
in the plan.


- The documentation is right: rows with little variation are pretty 
useless to index. Indexing is about "selectivity", reducing the amount 
of stuff the database has to read off the the disk.


- You only have two things in your WHERE clause, so that is where the 
most important indexes reside. How many of your rows have p1=53? How 
many of your rows have happened in the last day? If your answer is "a 
lot" then the indexes are not going to help: PostgreSQL will be more 
efficient scanning every tuple than it will be jumping around the index 
structure for a large number of tuples.


- If neither time nor p1 are particularly selective individually, but 
they are selective when taken together, try a multi-key index on them both.


Paul

Kjell Tore Fossbakk wrote:


Hello!

I use FreeBSD 4.11 with PostGreSQL 7.3.8.

I got a huge database with roughly 19 million records. There is just one
table, with a time field, a few ints and a few strings.


table test
fields time (timestamp), source (string), destination (string), p1 (int),
p2 (int)


I have run VACUUM ANALYZE ;

I have created indexes on every field, but for some reason my postgre
server wants to use a seqscan, even tho i know a indexed scan would be
much faster.


create index test_time_idx on test (time) ;
create index test_source_idx on test (source) ;
create index test_destination_idx on test (destination) ;
create index test_p1_idx on test (p1) ;
create index test_p2_idx on test (p2) ;



What is really strange, is that when i query a count(*) on one of the int
fields (p1), which has a very low count, postgre uses seqscan. In another
count on the same int field (p1), i know he is giving about 2.2 million
hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
the whole idea of indexing to increase performance in large queries.. To
make sort of a phonebook for the values, to make it faster to look up what
ever you need... This just seems opposite..

Here is a EXPLAIN of my query

database=> explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time > now() - interval '24 hours' group by
date_trunc order by date_trunc ;
QUERY PLAN
--
Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
   ->  Group  (cost=727622.61..731303.02 rows=736083 width=8)
 ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
   Sort Key: date_trunc('hour'::text, "time")
   ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083
width=8)
 Filter: ((p1 = 53) AND ("time" > (now() - '1
day'::interval)))
(6 rows)




database=> drop INDEX test_
test_source_idx test_destination_idxtest_p1_idx   
test_p2_idx   test_time_idx



After all this, i tried to set enable_seqscan to off and
enable_nestedloops to on. This didnt help much either. The time to run the
query is still in minutes. My results are the number of elements for each
hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
about PostGreSQL, that it can easily handle 100-200million records. And
with the right tuned system, have a great performance.. I would like to
learn how :)

I also found an article on a page
( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
Tip #11:  Don't bother indexing columns with huge numbers of records and a
small range of values, such as BOOLEAN columns.

This tip, regretfully, is perhaps the only tip where I cannot provide a
good, real-world example from my work.  So I'll give you a hypothetical
situation instead:

Imagine that you have a database table with a list of every establishment
vending ice cream in the US.  A simple example might look like:

Where there were almost 1 million rows, but due to simplistic data entry,
only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
3-OTHER) which are relatively evenly distributed.  In this hypothetical
situation, you might find (with testing using EXPLAIN) that an index on
type is ignored and the parser uses a "seq scan" (or table scan) instead.
This is because a table scan can actually be faster than an index scan in
this situation.  Thus, any index on type should be dropped.

Certainly, the boolean column (active) requires no indexing as it has only
two possible values and no index will be faster than a table scan.


Then I ask, what is useful with indexing, when I can't use it on a VERY
large database? It is on my 15 million record database it takes for ever
to do seqscans over and over again... This is probably why, as i mentioned
earlier, the reason (read the quote) why he chooses a full scan and not a
indexed one...

So what do I do? :confused:

I'v used SQL for years, but n

[PERFORM] Querying 19million records very slowly

2005-06-21 Thread Kjell Tore Fossbakk
Hello!I use FreeBSD 4.11 with PostGreSQL 7.3.8.I got a huge database with roughly 19 million records. There is just onetable, with a time field, a few ints and a few strings.table testfields time (timestamp), source (string), destination (string), p1 (int),
p2 (int)I have run VACUUM ANALYZE ;I have created indexes on every field, but for some reason my postgreserver wants to use a seqscan, even tho i know a indexed scan would bemuch faster.
create index test_time_idx on test (time) ;create index test_source_idx on test (source) ;create index test_destination_idx on test (destination) ;create index test_p1_idx on test (p1) ;create index test_p2_idx on test (p2) ;
What is really strange, is that when i query a count(*) on one of the intfields (p1), which has a very low count, postgre uses seqscan. In anothercount on the same int field (p1), i know he is giving about 
2.2 millionhits, but then he suddenly uses seqscan, instead of a indexed one. Isn'tthe whole idea of indexing to increase performance in large queries.. Tomake sort of a phonebook for the values, to make it faster to look up what
ever you need... This just seems opposite..Here is a EXPLAIN of my querydatabase=> explain select date_trunc('hour', time),count(*) as total fromtest where p1=53 and time > now() - interval '24 hours' group by
date_trunc order by date_trunc ;    QUERY PLAN--Aggregate  (cost=727622.61..733143.23
 rows=73608 width=8)   ->  Group  (cost=727622.61..731303.02 rows=736083 width=8) ->  Sort  (cost=727622.61..729462.81 rows=736083 width=8)   Sort Key: date_trunc('hour'::text, "time")
   ->  Seq Scan on test  (cost=0.00..631133.12 rows=736083width=8) Filter: ((p1 = 53) AND ("time" > (now() - '1day'::interval)))(6 rows)
database=> drop INDEX test_test_source_idx test_destination_idx    test_p1_idx    test_p2_idx   test_time_idxAfter all this, i tried to set enable_seqscan to off and
enable_nestedloops to on. This didnt help much either. The time to run thequery is still in minutes. My results are the number of elements for eachhour, and it gives about 1000-2000 hits per hour. I have read somewhere,
about PostGreSQL, that it can easily handle 100-200million records. Andwith the right tuned system, have a great performance.. I would like tolearn how :)I also found an article on a page(
http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):Tip #11:  Don't bother indexing columns with huge numbers of records and asmall range of values, such as BOOLEAN columns.This tip, regretfully, is perhaps the only tip where I cannot provide a
good, real-world example from my work.  So I'll give you a hypotheticalsituation instead:Imagine that you have a database table with a list of every establishmentvending ice cream in the US.  A simple example might look like:
Where there were almost 1 million rows, but due to simplistic data entry,only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and3-OTHER) which are relatively evenly distributed.  In this hypothetical
situation, you might find (with testing using EXPLAIN) that an index ontype is ignored and the parser uses a "seq scan" (or table scan) instead. This is because a table scan can actually be faster than an index scan in
this situation.  Thus, any index on type should be dropped.Certainly, the boolean column (active) requires no indexing as it has onlytwo possible values and no index will be faster than a table scan.
Then I ask, what is useful with indexing, when I can't use it on a VERYlarge database? It is on my 15 million record database it takes for everto do seqscans over and over again... This is probably why, as i mentioned
earlier, the reason (read the quote) why he chooses a full scan and not aindexed one...So what do I do? :confused:I'v used SQL for years, but never in such a big scale. Thus, not having tolearn how to deal with large number of records. Usually a maximum of 1000
records. Now, with millions, I need to learn a way to make my suckyqueries better.Im trying to learn more about tuning my system, makeing better queries andsuch. I'v found some documents on the Internet, but far from the best.
Feedback most appreciated!Regards,a learning PostGreSQL user