Re: [PERFORM] Querying 19million records very slowly

2005-06-22 Thread Paul Ramsey
=343125 width=8)
(actual time=6323.75..6579.42 rows=64267 loops=1)
   Sort Key: date_trunc('hour'::text, time)
   -  Index Scan using test_time_p1_idx on test
(cost=0.00..1035479.58 rows=343125 width=8) (actual time=0.20..5858.67
rows=64267 loops=1)
 Index Cond: ((time  (now() - '1
day'::interval)) AND (p1 = 80))
 Total runtime: 7322.82 msec

Time: 7323.90 ms



database= explain analyze select date_trunc('hour', time),count(*) as
total from test where p1=139 and timenow()-interval '24 hours' group
by date_trunc order by date_trunc;
 Aggregate  (cost=701562.34..703250.12 rows=22504 width=8) (actual
time=2448.41..3033.80 rows=22 loops=1)
   -  Group  (cost=701562.34..702687.53 rows=225037 width=8) (actual
time=2417.39..2884.25 rows=36637 loops=1)
 -  Sort  (cost=701562.34..702124.94 rows=225037 width=8)
(actual time=2417.38..2574.19 rows=36637 loops=1)
   Sort Key: date_trunc('hour'::text, time)
   -  Index Scan using test_time_p1_idx on test
(cost=0.00..679115.34 rows=225037 width=8) (actual time=8.47..2156.18
rows=36637 loops=1)
 Index Cond: ((time  (now() - '1
day'::interval)) AND (p1 = 139))
 Total runtime: 3034.57 msec

Time: 3035.70 ms



Now, this query gives me all the hours in a day, with the count of all
p1=53 for each hour. Pg uses 46.7 seconds to run with seqscan, while
2.7 seconds indexing on (time,p1). I think I turned set
enable_seqscan to on; again, and then the planner used seqscan, and
not index.
- Why does Pg not see the benefits of using index?
- and how can i tune the optimisation fields in postgresql.conf to  
help him?


So now my PG uses a reasonable amout of time on these queries (with
enable_seqscan turned off)

The next place which seems to slow my queries, is probably my
connection to PHP. I got a bash script running in cron on my server
(freebsd 4.11), which runs php on a php file. To force PG to not use
seqscans, I have modifies the postgresql.conf:

..
enable_seqscan = false
enable_indexscan = true
..
effective_cache_size = 1
random_page_cost = 2
..

I save the file, type 'pg_crl reload' then enter 'psql database'.

database= show enable_seqscan ;
 enable_seqscan

 on
(1 row)


argus= show effective_cache_size ;
 effective_cache_size
--
 1000
(1 row)

I have used the manual pages on postgresql, postmaster, and so on, but
I cant find anywhere to specify which config file Pg is to use. I'm
not entirely sure if he uses the one im editing
(/usr/local/etc/postgresql.conf).

Any hints, tips or help is most appreciated!

Kjell Tore.





On 6/21/05, PFC [EMAIL PROTECTED] wrote:


use CURRENT_TIME which is a constant instead of now() which is not
considered constant... (I think)



 Paul Ramsey
 Refractions Research
 Email: [EMAIL PROTECTED]
 Phone: (250) 885-0632


---(end of broadcast)---
TIP 6: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Querying 19million records very slowly

2005-06-21 Thread Paul Ramsey

Some tips:

- EXPLAIN ANALYZE provides a more useful analysis of a slow query, 
because it gives both the estimate and actual times/rows for each step 
in the plan.


- The documentation is right: rows with little variation are pretty 
useless to index. Indexing is about selectivity, reducing the amount 
of stuff the database has to read off the the disk.


- You only have two things in your WHERE clause, so that is where the 
most important indexes reside. How many of your rows have p1=53? How 
many of your rows have happened in the last day? If your answer is a 
lot then the indexes are not going to help: PostgreSQL will be more 
efficient scanning every tuple than it will be jumping around the index 
structure for a large number of tuples.


- If neither time nor p1 are particularly selective individually, but 
they are selective when taken together, try a multi-key index on them both.


Paul

Kjell Tore Fossbakk wrote:


Hello!

I use FreeBSD 4.11 with PostGreSQL 7.3.8.

I got a huge database with roughly 19 million records. There is just one
table, with a time field, a few ints and a few strings.


table test
fields time (timestamp), source (string), destination (string), p1 (int),
p2 (int)


I have run VACUUM ANALYZE ;

I have created indexes on every field, but for some reason my postgre
server wants to use a seqscan, even tho i know a indexed scan would be
much faster.


create index test_time_idx on test (time) ;
create index test_source_idx on test (source) ;
create index test_destination_idx on test (destination) ;
create index test_p1_idx on test (p1) ;
create index test_p2_idx on test (p2) ;



What is really strange, is that when i query a count(*) on one of the int
fields (p1), which has a very low count, postgre uses seqscan. In another
count on the same int field (p1), i know he is giving about 2.2 million
hits, but then he suddenly uses seqscan, instead of a indexed one. Isn't
the whole idea of indexing to increase performance in large queries.. To
make sort of a phonebook for the values, to make it faster to look up what
ever you need... This just seems opposite..

Here is a EXPLAIN of my query

database= explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time  now() - interval '24 hours' group by
date_trunc order by date_trunc ;
QUERY PLAN
--
Aggregate  (cost=727622.61..733143.23 rows=73608 width=8)
   -  Group  (cost=727622.61..731303.02 rows=736083 width=8)
 -  Sort  (cost=727622.61..729462.81 rows=736083 width=8)
   Sort Key: date_trunc('hour'::text, time)
   -  Seq Scan on test  (cost=0.00..631133.12 rows=736083
width=8)
 Filter: ((p1 = 53) AND (time  (now() - '1
day'::interval)))
(6 rows)




database= drop INDEX test_TABULATOR
test_source_idx test_destination_idxtest_p1_idx   
test_p2_idx   test_time_idx



After all this, i tried to set enable_seqscan to off and
enable_nestedloops to on. This didnt help much either. The time to run the
query is still in minutes. My results are the number of elements for each
hour, and it gives about 1000-2000 hits per hour. I have read somewhere,
about PostGreSQL, that it can easily handle 100-200million records. And
with the right tuned system, have a great performance.. I would like to
learn how :)

I also found an article on a page
( http://techdocs.postgresql.org/techdocs/pgsqladventuresep3.php):
Tip #11:  Don't bother indexing columns with huge numbers of records and a
small range of values, such as BOOLEAN columns.

This tip, regretfully, is perhaps the only tip where I cannot provide a
good, real-world example from my work.  So I'll give you a hypothetical
situation instead:

Imagine that you have a database table with a list of every establishment
vending ice cream in the US.  A simple example might look like:

Where there were almost 1 million rows, but due to simplistic data entry,
only three possible values for type (1-SUPERMARKET, 2-BOUTIQUE, and
3-OTHER) which are relatively evenly distributed.  In this hypothetical
situation, you might find (with testing using EXPLAIN) that an index on
type is ignored and the parser uses a seq scan (or table scan) instead.
This is because a table scan can actually be faster than an index scan in
this situation.  Thus, any index on type should be dropped.

Certainly, the boolean column (active) requires no indexing as it has only
two possible values and no index will be faster than a table scan.


Then I ask, what is useful with indexing, when I can't use it on a VERY
large database? It is on my 15 million record database it takes for ever
to do seqscans over and over again... This is probably why, as i mentioned
earlier, the reason (read the quote) why he chooses a full scan and not a
indexed one...

So what do I do? :confused:

I'v used SQL for years, but never in 

Re: [PERFORM] The never ending quest for clarity on shared_buffers

2004-10-06 Thread Paul Ramsey
Doug Y wrote:
  For idle persistent connections, do each of them allocate the memory 
specified by this setting (shared_buffers * 8k), or is it one pool used 
by all the connection (which seems the logical conclusion based on the 
name SHARED_buffers)? Personally I'm more inclined to think the latter 
choice, but I've seen references that alluded to both cases, but never a 
definitive answer.
The shared_buffers are shared (go figure) :).  It is all one pool shared 
by all connections.  The sort_mem and vacuum_mem are *per*connection* 
however, so when allocating that size you have to take into account your 
expected number of concurrent connections.

Paul
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org