Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread Arjen van der Meijden

Hi John,

You don't use the same 'gene_ref ='-value, so its not a perfect 
comparison. And obviously, there is the fact that the data can be in the 
disk cache, the second time you run it, which would explain the almost 
instantaneous result for the second query.


If repeating the query a few times with 200 still makes it do its work 
in 15 seconds and with 800 in less than 100ms, than you might have found 
a bug, or it is at least something I don't know how to fix.


I doubt upping the default for all tables to 1000 is a good idea. The 
data collected is used in the query-planning-stage, where more data 
means more processing time. Obviously there is a tradeoff somewhere 
between having more statistics and thus being able to plan the query 
better versus requiring more time to process those statistics.


Best regards,

Arjen

On 10-4-2008 0:24 John Beaver wrote:
Perfect - thanks Arjen. Using your value of 200 decreased the time to 15 
seconds, and using a value of 800 makes it almost instantaneous. I'm 
really not concerned about space usage; if having more statistics 
increases performance this much, maybe I'll just default it to 1000?


Strangely, the steps taken in the explain analyze are all the same. The 
only differences are the predicted costs (and execution times).


explain analyze for a statistics of 200:
"Aggregate  (cost=8831.27..8831.28 rows=1 width=0) (actual 
time=15198.407..15198.408 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on gene_prediction_view  (cost=44.16..8825.29 
rows=2392 width=0) (actual time=19.719..15191.875 rows=2455 loops=1)"

"Recheck Cond: (gene_ref = 500)"
"->  Bitmap Index Scan on ix_gene_prediction_view_gene_ref  
(cost=0.00..43.56 rows=2392 width=0) (actual time=18.871..18.871 
rows=2455 loops=1)"

"  Index Cond: (gene_ref = 500)"
"Total runtime: 15198.651 ms"

explain analyze for a statistics of 800:
"Aggregate  (cost=8873.75..8873.76 rows=1 width=0) (actual 
time=94.473..94.473 rows=1 loops=1)"
"  ->  Bitmap Heap Scan on gene_prediction_view  (cost=44.25..8867.74 
rows=2404 width=0) (actual time=39.358..93.733 rows=2455 loops=1)"

"Recheck Cond: (gene_ref = 301)"
"->  Bitmap Index Scan on ix_gene_prediction_view_gene_ref  
(cost=0.00..43.65 rows=2404 width=0) (actual time=38.472..38.472 
rows=2455 loops=1)"

"  Index Cond: (gene_ref = 301)"
"Total runtime: 94.622 ms"




Arjen van der Meijden wrote:
First of all, there is the 'explain analyze' output, which is pretty 
helpful in postgresql.


My guess is, postgresql decides to do a table scan for some reason. It 
might not have enough statistics for this particular table or column, 
to make a sound decision. What you can try is to increase the 
statistics target, which works pretty easy:

ALTER TABLE gene_prediction_view ALTER gene_ref SET STATISTICS 200;

Valid ranges are from 1(0?) - 1000, the default is 10, the default on 
my systems is usually 100. For such a large table, I'd go with 200.


After that, you'll need to re-analyze your table and you can try again.

Perhaps analyze should try to establish its own best guess to how many 
samples it should take? The default of 10 is rather limited for large 
tables.


Best regards,

Arjen

On 9-4-2008 22:58 John Beaver wrote:
Hi, I've started my first project with Postgres (after several years 
of using Mysql), and I'm having an odd performance problem that I was 
hoping someone might be able to explain the cause of.


My query
   - select count(*) from gene_prediction_view where gene_ref = 523
   - takes 26 seconds to execute, and returns 2400 (out of a total of 
15 million records in the table)


---My problem---
   Using a single-column index to count 2400 records which are 
exactly one constant value doesn't sound like something that would 
take 26 seconds. What's the slowdown? Any silver bullets that might 
fix this?


Steps I've taken
   - I ran vacuum and analyze
   - I upped the shared_buffers to 58384, and I upped some of the 
other postgresql.conf values as well. Nothing seemed to help 
significantly, but maybe I missed something that would help 
specifically for this query type?
   - I tried to create a hash index, but gave up after more than 4 
hours of waiting for it to finish indexing


Table stats
   - 15 million rows; I'm expecting to have four or five times this 
number eventually.

   - 1.5 gigs of hard drive usage

My development environment---
   - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm 
hard drive

   - OS X 10.5.2
   - Postgres 8.3 (installed via MacPorts)

My table

CREATE TABLE gene_prediction_view
(
 id serial NOT NULL,
 gene_ref integer NOT NULL,
 go_id integer NOT NULL,
 go_description character varying(200) NOT NULL,
 go_category character varying(50) NOT NULL,
 function_verified_exactly boolean NOT NULL,
 function_verified_with_parent_go boolean NOT NULL,
 function_verified_with_child_go boolean NOT NULL,
 score numeric(10,2) 

Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread PFC


Perfect - thanks Arjen. Using your value of 200 decreased the time to  
15 seconds, and using a value of 800 makes it almost instantaneous. I'm  
really not concerned about space usage; if having more statistics  
increases performance this much, maybe I'll just default it to 1000?
 Strangely, the steps taken in the explain analyze are all the same.  
The only differences are the predicted costs (and execution times).

 explain analyze for a statistics of 200:



	Actually, since you got the exact same plans and the second one is a lot  
faster, this can mean that the data is in the disk cache, or that the  
second query has all the rows it needs contiguous on disk whereas the  
first one has its rows all over the place. Therefore you are IO-bound.  
Statistics helped, perhaps (impossible to know since you don't provide the  
plan wit statistics set to 10), but your main problem is IO.

Usually setting the statistics to 100 is enough...

Now, here are some solutions to your problem in random order :

- Install 64 bit Linux, 64 bit Postgres, and get lots of RAM, lol.
	- Switch to a RAID10 (4 times the IOs per second, however zero gain if  
you're single-threaded, but massive gain when concurrent)


	- If you just need a count by gene_ref, a simple solution is to keep it  
in a separate table and update it via triggers, this is a frequently used  
solution, it works well unless gene_ref is updated all the time (which is  
probably not your case). Since you will be vacuuming this count-cache  
table often, don't put the count as a field in your sgd_annotations table,  
just create a small table with 2 fields, gene_ref and count (unless you  
want to use the count for other things and you don't like the join).


	From your table definition gene_ref references another table. It would  
seem that you have many rows in gene_prediction_view with the same  
gene_ref value.


	- If you often query rows with the same gene_ref, consider using CLUSTER  
to physically group those rows on disk. This way you can get all rows with  
the same gene_ref in 1 seek instead of 2000. Clustered tables also make  
Bitmap scan happy.
	This one is good since it can also speed up other queries (not just the  
count).
	You could also cluster on (gene_ref,go_id) perhaps, I don't know what  
your columns mean. Only you can decide that because clustering order has  
to be meaningful (to group rows according to something that makes sense  
and not at random).


* Lose some weight :

CREATE INDEX ix_gene_prediction_view_gene_ref
 ON gene_prediction_view
 USING btree
 (gene_ref);

	- This index is useless since you have an UNIQUE on (gene_ref, go_id)  
which is also an index.
	Remove the index on (gene_ref), it will leave space in the disk cache for  
other things.


	- Since (gene_ref, go_id) is UNIQUE NOT NULL, you might be able to use  
that as your primary key, but only if it is never updated of course. Saves  
another index.


	- If you often do queries that fetch many rows, but seldom fetch the  
description, tell PG to always store the description in offline compressed  
form (read the docs on ALTER TABLE ... SET STORAGE ..., I forgot the  
syntax). Point being to make the main table smaller.


	- Also I see a category as VARCHAR. If you have a million different  
categories, that's OK, but if you have 100 categories for your 15M rows,  
put them in a separate table and replace that by a category_id (normalize  
!)



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] varchar index joins not working?

2008-04-10 Thread Richard Huxton

Adam Gundy wrote:

I'm hitting an unexpected problem with postgres 8.3 - I have some
tables which use varchar(32) for their unique IDs which I'm attempting
to join using some simple SQL:

select *
from group_access, groups
where group_access.groupid = groups.groupid and
 group_access.uid = '7275359408f44591d0717e16890ce335';

there's a unique index on group_access.groupid, and a non-unique index
on groups.groupid. both are non-null.


What about group_access.uid - I'd have thought that + groups pkey is 
probably the sensible combination here.



the problem is: if groupid (in both tables) is varchar, I cannot force
postgres (no matter how hard I try) to do an index scan. it ends up
reading the entire groups table (pretty large!):


OK


QUERY PLAN
--
 Hash Join  (cost=8.89..41329.88 rows=119940 width=287) (actual
time=0.202..935.136 rows=981 loops=1)


That's because it's expecting 119,940 rows to match (rather than the 
actual 981 you do get). If you were getting that many results this is 
probably a sensible plan.



   Hash Cond: ((groups.groupid)::text = (group_access.groupid)::text)
   ->  Seq Scan on groups  (cost=0.00..31696.48 rows=1123348
width=177) (actual time=0.011..446.091 rows=1125239 loops=1)


It's got a good idea of the total number of rows in groups.


   ->  Hash  (cost=8.51..8.51 rows=30 width=110) (actual
time=0.148..0.148 rows=30 loops=1)
 ->  Seq Scan on group_access  (cost=0.00..8.51 rows=30
width=110) (actual time=0.014..0.126 rows=30 loops=1)


And also group_access. Oh, the seq-scan doesn't really matter here. It 
probably *is* faster to read all 30 rows in one burst rather than go to 
the index and then back to the table.



   Filter: ((uid)::text = '7275359408f44591d0717e16890ce335'::text)
 Total runtime: 935.443 ms
(7 rows)

if I disable seq_scan, I get this:

   QUERY PLAN

 Merge Join  (cost=1.47..106189.61 rows=120004 width=287) (actual
time=0.100..1532.353 rows=981 loops=1)


It's still thinking it's going to get 120 thousand rows.


it's running an index scan across the entire table (no condition applied) :-(

so, just for the hell of it, I tried making groupid a char(32),
despite repeated assertions in this group that there's no performance
difference between the two:


There's no performance difference between the two.


 Nested Loop  (cost=4.48..253.85 rows=304 width=291) (actual
time=0.715..22.906 rows=984 loops=1)



(this last plan is actually against a smaller test DB, but I get the
same behavior with it, seq scan for varchar or index scan for char,
and the results returned are identical for this query)


The char(32) thing isn't important here, what is important is that it's 
expecting ~300 rows rather than 120,000. It's still wrong, but it's 
close enough to make sense.


So - the question is - why is PG expecting so many matches to your join. 
How many distinct values do you have in groups.groupid and 
group_access.group_id?


--
  Richard Huxton
  Archonet Ltd

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread Matthew

On Thu, 10 Apr 2008, PFC wrote:

... Lots of useful advice ...

	- If you often query rows with the same gene_ref, consider using 
CLUSTER to physically group those rows on disk. This way you can get all rows 
with the same gene_ref in 1 seek instead of 2000. Clustered tables also make 
Bitmap scan happy.


In my opinion this is the one that will make the most difference. You will 
need to run:


CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;

after you insert significant amounts of data into the table. This 
re-orders the table according to the index, but new data is always written 
out of order, so after adding lots more data the table will need to be 
re-clustered again.


- Switch to a RAID10 (4 times the IOs per second, however zero gain if 
you're single-threaded, but massive gain when concurrent)


Greg Stark has a patch in the pipeline that will change this, for bitmap 
index scans, by using fadvise(), so a single thread can utilise multiple 
discs in a RAID array.


Matthew

--
Prolog doesn't have enough parentheses. -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread John Beaver
Thanks a lot, all of you - this is excellent advice. With the data 
clustered and statistics at a more reasonable value of 100, it now 
reproducibly takes even less time - 20-57 ms per query.


After reading the section on "Statistics Used By the Planner" in the 
manual, I was a little concerned that, while the statistics sped up the 
queries that I tried immeasurably, that the most_common_vals array was 
where the speedup was happening, and that the values which wouldn't fit 
in this array wouldn't be sped up. Though I couldn't offhand find an 
example where this occurred, the clustering approach seems intuitively 
like a much more complete and scalable solution, at least for a 
read-only table like this.


As to whether the entire index/table was getting into ram between my 
statistics calls, I don't think this was the case. Here's the behavior 
that I found:
- With statistics at 10, the query took 25 (or so) seconds no matter how 
many times I tried different values. The query plan was the same as for 
the 200 and 800 statistics below.
- Trying the same constant a second time gave an instantaneous result, 
I'm guessing because of query/result caching.
- Immediately on increasing the statistics to 200, the query took a 
reproducibly less amount of time. I tried about 10 different values
- Immediately on increasing the statistics to 800, the query 
reproducibly took less than a second every time. I tried about 30 
different values.
- Decreasing the statistics to 100 and running the cluster command 
brought it to 57 ms per query.
- The Activity Monitor (OSX) lists the relevant postgres process as 
taking a little less than 500 megs.
- I didn't try decreasing the statistics back to 10 before I ran the 
cluster command, so I can't show the search times going up because of 
that. But I tried killing the 500 meg process. The new process uses less 
than 5 megs of ram, and still reproducibly returns a result in less than 
60 ms. Again, this is with a statistics value of 100 and the data 
clustered by gene_prediction_view_gene_ref_key.


And I'll consider the idea of using triggers with an ancillary table for 
other purposes; seems like it could be a useful solution for something.


Matthew wrote:

On Thu, 10 Apr 2008, PFC wrote:

... Lots of useful advice ...

- If you often query rows with the same gene_ref, consider using 
CLUSTER to physically group those rows on disk. This way you can get 
all rows with the same gene_ref in 1 seek instead of 2000. Clustered 
tables also make Bitmap scan happy.


In my opinion this is the one that will make the most difference. You 
will need to run:


CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;

after you insert significant amounts of data into the table. This 
re-orders the table according to the index, but new data is always 
written out of order, so after adding lots more data the table will 
need to be re-clustered again.


- Switch to a RAID10 (4 times the IOs per second, however zero gain 
if you're single-threaded, but massive gain when concurrent)


Greg Stark has a patch in the pipeline that will change this, for 
bitmap index scans, by using fadvise(), so a single thread can utilise 
multiple discs in a RAID array.


Matthew



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Creating large database of MD5 hash values

2008-04-10 Thread Jon Stewart
Hello,

I am creating a large database of MD5 hash values. I am a relative
newb with PostgreSQL (or any database for that matter). The schema and
operation will be quite simple -- only a few tables, probably no
stored procedures -- but I may easily end up with several hundred
million rows of hash values, possible even get into the billions. The
hash values will be organized into logical sets, with a many-many
relationship. I have some questions before I set out on this endeavor,
however, and would appreciate any and all feedback, including SWAGs,
WAGs, and outright lies. :-) I am trying to batch up operations as
much as possible, so I will largely be doing comparisons of whole
sets, with bulk COPY importing. I hope to avoid single hash value
lookup as much as possible.

1. Which datatype should I use to represent the hash value? UUIDs are
also 16 bytes...
2. Does it make sense to denormalize the hash set relationships?
3. Should I index?
4. What other data structure options would it make sense for me to choose?

Thanks in advance,


Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] varchar index joins not working?

2008-04-10 Thread Adam Gundy

Richard Huxton wrote:

Adam Gundy wrote:

I'm hitting an unexpected problem with postgres 8.3 - I have some
tables which use varchar(32) for their unique IDs which I'm attempting
to join using some simple SQL:

select *
from group_access, groups
where group_access.groupid = groups.groupid and
 group_access.uid = '7275359408f44591d0717e16890ce335';

there's a unique index on group_access.groupid, and a non-unique index
on groups.groupid. both are non-null.


What about group_access.uid - I'd have thought that + groups pkey is 
probably the sensible combination here.


that is an index on group_access:

"group_access_pkey" PRIMARY KEY, btree (groupid, uid)

adding the (uid, groupid) index helps the small database, it will do an 
index join if forced to, but the full database still refuses to do an 
index join - it does a full index scan followed by a merge.



QUERY PLAN
-- 


 Hash Join  (cost=8.89..41329.88 rows=119940 width=287) (actual
time=0.202..935.136 rows=981 loops=1)


That's because it's expecting 119,940 rows to match (rather than the 
actual 981 you do get). If you were getting that many results this is 
probably a sensible plan.


sure. but it's estimate is *wildly* off


   Hash Cond: ((groups.groupid)::text = (group_access.groupid)::text)
   ->  Seq Scan on groups  (cost=0.00..31696.48 rows=1123348
width=177) (actual time=0.011..446.091 rows=1125239 loops=1)


It's got a good idea of the total number of rows in groups.


yeah.


   ->  Hash  (cost=8.51..8.51 rows=30 width=110) (actual
time=0.148..0.148 rows=30 loops=1)
 ->  Seq Scan on group_access  (cost=0.00..8.51 rows=30
width=110) (actual time=0.014..0.126 rows=30 loops=1)


And also group_access. Oh, the seq-scan doesn't really matter here. It 
probably *is* faster to read all 30 rows in one burst rather than go to 
the index and then back to the table.


agreed.

it's running an index scan across the entire table (no condition 
applied) :-(


so, just for the hell of it, I tried making groupid a char(32),
despite repeated assertions in this group that there's no performance
difference between the two:


There's no performance difference between the two.


hah. if it makes the join with char (and runs fast), or reads the whole 
table with varchar, then there *is* a performance difference - a big one!


The char(32) thing isn't important here, what is important is that it's 
expecting ~300 rows rather than 120,000. It's still wrong, but it's 
close enough to make sense.


So - the question is - why is PG expecting so many matches to your join. 


more to the point, why does it get the estimate right (or close) with 
char, but massively wrong with varchar? I've been vacuum analyzing after 
each change..


with the smaller database, and char type, it (for certain joins) still 
wants to do a seqscan because the tables are small enough, but if I 
disable seqscan, it does an index join (usually with a small time 
penalty). if I switch the types back to varchar, re-analyze, re-run, it 
*will not* do an index join!


How many distinct values do you have in groups.groupid and 
group_access.group_id?


for the small database (since it shows the same problem):

group_access: 280/268
groups: 2006/139

for the large database:

group_access: same
groups: 1712647/140

the groupid key is an MD5 hash, so it should be uniformly distributed. 
maybe that throws the stats? but, again, char works, varchar doesn't :-(




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread Erik Jones


On Apr 10, 2008, at 9:44 AM, John Beaver wrote:
Thanks a lot, all of you - this is excellent advice. With the data  
clustered and statistics at a more reasonable value of 100, it now  
reproducibly takes even less time - 20-57 ms per query.


After reading the section on "Statistics Used By the Planner" in the  
manual, I was a little concerned that, while the statistics sped up  
the queries that I tried immeasurably, that the most_common_vals  
array was where the speedup was happening, and that the values which  
wouldn't fit in this array wouldn't be sped up. Though I couldn't  
offhand find an example where this occurred, the clustering approach  
seems intuitively like a much more complete and scalable solution,  
at least for a read-only table like this.


As to whether the entire index/table was getting into ram between my  
statistics calls, I don't think this was the case. Here's the  
behavior that I found:
- With statistics at 10, the query took 25 (or so) seconds no matter  
how many times I tried different values. The query plan was the same  
as for the 200 and 800 statistics below.
- Trying the same constant a second time gave an instantaneous  
result, I'm guessing because of query/result caching.
- Immediately on increasing the statistics to 200, the query took a  
reproducibly less amount of time. I tried about 10 different values
- Immediately on increasing the statistics to 800, the query  
reproducibly took less than a second every time. I tried about 30  
different values.
- Decreasing the statistics to 100 and running the cluster command  
brought it to 57 ms per query.
- The Activity Monitor (OSX) lists the relevant postgres process as  
taking a little less than 500 megs.
- I didn't try decreasing the statistics back to 10 before I ran the  
cluster command, so I can't show the search times going up because  
of that. But I tried killing the 500 meg process. The new process  
uses less than 5 megs of ram, and still reproducibly returns a  
result in less than 60 ms. Again, this is with a statistics value of  
100 and the data clustered by gene_prediction_view_gene_ref_key.


And I'll consider the idea of using triggers with an ancillary table  
for other purposes; seems like it could be a useful solution for  
something.


FWIW, killing the backend process responsible for the query won't  
necessarily clear the table's data from memory as that will be in the  
shared_buffers.  If you really want to flush the data from memory you  
need to read in data from other tables of a size total size greater  
than your shared_buffers setting.


Erik Jones

DBA | Emma®
[EMAIL PROTECTED]
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance with temporary table

2008-04-10 Thread valgog

I see, I am having practically the same problem... utilizing
partitioning idea 
http://www.postgresql.org/docs/8.3/interactive/ddl-partitioning.html
by table inheritance.

I have prepared a post with some trigger and rule examples for you
http://valgogtech.blogspot.com/2008/04/table-partitioning-automation-triggers.html
. So I hope you will find it useful if you are not doing it already
yourself :-).

About the use of the temporary table, I would say, that you actually
could try to add some special row status flag colum (I use "char" for
such flags) to your partitioned tables to mark some rows as unused and
then create some conditional indexes that consider this flag for your
data operation... This would make it possible for you not to creating
temporary tables I hope...

With best regards,

-- Valentine

On Apr 10, 1:33 am, [EMAIL PROTECTED] ("samantha mahindrakar")
wrote:
> The partitions are used to separate the data according to months. I
> have run a query o find bad data from each such partition. The
> imputation algorithm that i use requires data from 10 previous weeks
> in order to impute the data. This historical data i store in a
> temporary table, the i query this data so that i can take a average of
> all the historical data. Before taking average some computations are
> performed. Since i need the historical data for every minute of data
> that i need to impute i have to store the data in some intermediate
> table. Hence the temporary table.
> Now i changed the code to use a permanent table that is truncated
> after one set of data is imputed.
> I hope this makes sense.
>
> Samantha
>
>
> On Wed, Apr 9, 2008 at 6:44 AM, valgog <[EMAIL PROTECTED]> wrote:
> > On Apr 7, 8:27 pm, [EMAIL PROTECTED] ("samantha mahindrakar")
> >  wrote:
>
> > > Hi
> >  > I have written a program that imputes(or rather corrects data) with in
> >  > my database.
> >  > Iam using a temporary table where in i put data from other partitoined
> >  > table. I then query this table to get the desired data.But the thing
> >  > is this temporary table has to be craeted for every record that i need
> >  > to correct and there are thousands of such records that need to be
> >  > corrected.
> >  > So the program necessarily creates a temporary table evrytime it has
> >  > to correct a record. However this table is dropeed after each record
> >  > is corrected.
> >  > The program works fine.but it runs for a very long timeor it
> >  > runs for days.
> >  > Iam particularyly finding that it takes more time during this statement:
>
> >  > NOTICE:  theQuery in createtablevolumelaneshist CREATE TEMPORARY TABLE
> >  > predictiontable(lane_id, measurement_start, speed,volume,occupancy) AS
> >  > SELECT lane_id, measurement_start, speed,volume,occupancy
> >  > FROM samantha.lane_data_I_495 WHERE
> >  > lane_id IN (1317) AND
> >  > measurement_start BETWEEN '2007-11-18 09:25:00' AND 2007-11-19 01:39:06'
>
> >  > Iam not sure if i can use a cursor to replicate the functionality of
> >  > the temp table. Is the performance bad because of the creation and
> >  > deletion of the temp table?
>
> >  > Thanks
> >  > Samantha
>
> >  > --
> >  > Sent via pgsql-performance mailing list ([EMAIL PROTECTED])
> >  > To make changes to your 
> > subscription:http://www.postgresql.org/mailpref/pgsql-performance
>
> >  And why do you copy data from the partition tables? Did you try to
> >  manipulate data directly in the needed tables? Or you are aggregating
> >  some of the data there? How the partitioning is actually designed? Do
> >  you use table inheritance?
>
> >  -- Valentine
>


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread John Beaver




Thanks Eric and Gaestano - interesting, and both examples of my
naivite. :)

I tried running large select(*) queries on other tables followed by
another try at the offending query, and it was still fast. Just to be
absolutely sure this is a scalable solution, I'll try restarting my
computer in a few hours to see if it affects anything cache-wise.


Gaetano Mendola wrote:

  John Beaver wrote:

  
  
- Trying the same constant a second time gave an instantaneous result,
I'm guessing because of query/result caching.

  
  
AFAIK no query/result caching is in place in postgres, what you are experiencing
is OS disk/memory caching.


Regards
Gaetano Mendola


  





Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread Bill Moran
In response to John Beaver <[EMAIL PROTECTED]>:

> Thanks Eric and Gaestano - interesting, and both examples of my naivite. :)
> 
> I tried running large select(*) queries on other tables followed by another 
> try at the offending query, and it was still fast. Just to be absolutely sure 
> this is a scalable solution, I'll try restarting my computer in a few hours 
> to see if it affects anything cache-wise.

I say this over and over again ... because I think it's really cool and
useful.

If you install the pg_buffercache addon, you can actually look into
PostgreSQL's internals and see what tables are in the buffer in real
time.  If you're having trouble, it can (potentially) be a helpful
tool.

-- 
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

[EMAIL PROTECTED]
Phone: 412-422-3463x4023

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread Greg Smith

On Thu, 10 Apr 2008, Bill Moran wrote:


If you install the pg_buffercache addon, you can actually look into
PostgreSQL's internals and see what tables are in the buffer in real
time.


The "Inside the PostgreSQL Buffer Cache" talk I did at the recent East 
conference is now on-line at 
http://www.westnet.com/~gsmith/content/postgresql/


The slides explain how that information gets updated and used internally, 
and the separate "sample queries" file there shows some more complicated 
views I've written against pg_buffercache.  Here's a sample one:


relname   |buffered| buffers % | % of rel
accounts  | 306 MB | 65.3  | 24.7
accounts_pkey | 160 MB | 34.1  | 93.2

This shows that 65.3% of the buffer cache is filled with the accounts 
table, which is caching 24.7% of the full table.  These are labeled 
"relations" because there's a mix of table and index data there. 
accounts_pkey is an index for example, which is why almost all of it is 
staying inside the buffer cache.


The queries that use usage_count only work against 8.3, that one above 
should work on older versions as well.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] varchar index joins not working?

2008-04-10 Thread Adam Gundy
On Thu, Apr 10, 2008 at 8:52 AM, Adam Gundy <[EMAIL PROTECTED]> wrote:
> Richard Huxton wrote:
> > How many distinct values do you have in groups.groupid and
> group_access.group_id?
> >
>
>  for the small database (since it shows the same problem):
>
>  group_access: 280/268
>  groups: 2006/139
>
>  for the large database:
>
>  group_access: same
>  groups: 1712647/140
>
>  the groupid key is an MD5 hash, so it should be uniformly distributed.
> maybe that throws the stats? but, again, char works, varchar doesn't :-(

OK, I'm thinking the varchar/char part is not the issue.

the database is very unbalanced, most of the groups are 1000 or less
records, with one group occupying 95% of the records.

I *think* that when I analyze using char instead of varchar, it is
recording a stat for the large group, but for some reason with varchar
doesn't add a stat for that one.

so, the real question is, how do I fix this? I can turn the stats way
up to 1000, but that doesn't guarantee that I'll get a stat for the
large group :-(

can I turn the statistics off completely for this column? I'm guessing
that if I can, that will mean it takes a guess based on the number of
distinct values in the groups table, which is still large number of
records, possibly enough to trigger the seqscan anyway.

does postgres have a way of building a 'counted index' that the
planner can use for it's record counts? some way of forcibly
maintaining a stat for every group?

the groups are not related to one another - is it possible to
partition them into their own indexes somehow?

ahh. lots of questions, no (obvious to me) answers from googling around.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Partitioned tables - planner wont use indexes

2008-04-10 Thread kevin kempter

Hi List;

I'm having some performance issues with a partitioned table. We have a  
VERY large table that we've partitioned by day.


Currently we have 17 partitions - each partition table contains >  
700million rows.
One of the things we need to query is the min date from the master  
table - we may explore alternatives for this particular query, however  
even if we fix this query I think we have a fundamental issue with the  
use of indexes (actuallt the non-use) by the planner.


Below is a sample of the DDL used to create our tables and an explain  
showing that the planner wants to do a sequential scan on each  
partition. We do have "constraint_elimination = on" set in the  
postgresql.conf file.


I tried removing the index from the part_master table and got the same  
result


Likewise the costs associated with the seq scans seem to be way off  
(yes I've run analyze on the master and all partition tables) - I ran  
the actual SQL statement below and killed it after about 15min.


Thanks in advance for any help, advice, etc...




Tables:

--
-- Master Table
--
CREATE TABLE part_master (
filename character varying(100),
logdate date,
... -- about 50 more columns go here
loghour date,
url character varying(500),
customer character varying(500)
);
CREATE INDEX master_logdate ON part_master USING btree (logdate);

--
-- Partitions:
--

--
-- part_20080319
--
CREATE TABLE part_20080319 (CONSTRAINT part_20080319_logdate_check
CHECK ((logdate = '2008-03-19'::date))
)
INHERITS (part_master);


CREATE INDEX idx_part_20080319_customer ON part_20080319 USING btree  
(customer);
CREATE INDEX idx_part_20080319_logdate ON part_20080319 USING btree  
(logdate);
CREATE INDEX idx_part_20080319_loghour ON part_20080319 USING btree  
(loghour);



--
-- part_20080320
--
CREATE TABLE part_20080320 (CONSTRAINT part_20080320_logdate_check
CHECK ((logdate = '2008-03-20'::date))
)
INHERITS (part_master);


CREATE INDEX idx_part_20080320_customer ON part_20080320 USING btree  
(customer);
CREATE INDEX idx_part_20080320_logdate ON part_20080320 USING btree  
(logdate);
CREATE INDEX idx_part_20080320_loghour ON part_20080320 USING btree  
(loghour);



-- And so on, thru part_20080404



--
-- explain plan
--

myDB=# explain SELECT min(logdate) FROM part_master;
   QUERY PLAN
-
 Aggregate  (cost=117070810.10..117070810.11 rows=1 width=4)
   ->  Append  (cost=0.00..114866502.48 rows=881723048 width=4)
 ->  Seq Scan on part_master  (cost=0.00..85596244.18  
rows=679385718 width=4)
 ->  Seq Scan on part_20080319 part  (cost=0.00..212860.86  
rows=1674986 width=4)
 ->  Seq Scan on part_20080320 part  (cost=0.00..1753802.51  
rows=13782951 width=4)
 ->  Seq Scan on part_20080321 part  (cost=0.00..2061636.83  
rows=15881283 width=4)
 ->  Seq Scan on part_20080322 part  (cost=0.00..1965144.71  
rows=14936971 width=4)
 ->  Seq Scan on part_20080323 part  (cost=0.00..1614413.18  
rows=12345618 width=4)
 ->  Seq Scan on part_20080324 part  (cost=0.00..1926520.22  
rows=14741022 width=4)
 ->  Seq Scan on part_20080325 part  (cost=0.00..2356704.22  
rows=18477622 width=4)
 ->  Seq Scan on part_20080326 part  (cost=0.00..1889267.71  
rows=14512171 width=4)
 ->  Seq Scan on part_20080327 part  (cost=0.00..1622100.34  
rows=12445034 width=4)
 ->  Seq Scan on part_20080328 part  (cost=0.00..1711779.49  
rows=12885749 width=4)
 ->  Seq Scan on part_20080329 part  (cost=0.00..1568192.94  
rows=11958394 width=4)
 ->  Seq Scan on part_20080330 part  (cost=0.00..1521204.64  
rows=11676564 width=4)
 ->  Seq Scan on part_20080331 part  (cost=0.00..1587138.77  
rows=12180377 width=4)
 ->  Seq Scan on part_20080401 part  (cost=0.00..2324352.82  
rows=18211382 width=4)
 ->  Seq Scan on part_20080402 part  (cost=0.00..2891295.04  
rows=6693804 width=4)
 ->  Seq Scan on part_20080403 part  (cost=0.00..1707327.48  
rows=5748348 width=4)
 ->  Seq Scan on part_20080404 part  (cost=0.00..556516.54  
rows=4185054 width=4)

(20 rows)



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] recommendations for web/db connection pooling or DBD::Gofer reviews

2008-04-10 Thread Mark Stosberg


Under heavy load, Apache has the usual failure mode of spawning so 
many threads/processes and database connections that it just exhausts 
all the memory on the webserver and also kills the database.
As usual, I would use lighttpd as a frontend (also serving static 
files) to handle the large number of concurrent connections to clients, 
and then have it funnel this to a reasonable number of perl backends, 
something like 10-30. I don't know if fastcgi works with perl, but with 
PHP it certainly works very well. If you can't use fastcgi, use lighttpd 
as a HTTP proxy and apache with mod_perl behind.
Recipe for good handling of heavy load is using an asynchronous 
server (which by design can handle any number of concurrent connections 
up to the OS' limit) in front of a small number of dynamic webpage 
generating threads/processes.


Thanks for the response.

To be clear, it sounds like you are advocating solving the problem with 
scaling the number of connections with a different approach, by limiting 
the number of web server processes.


So, the front-end proxy would have a number of max connections, say 200, 
 and it would connect to another httpd/mod_perl server behind with a 
lower number of connections, say 20. If the backend httpd server was 
busy, the proxy connection to it would just wait in a queue until it was 
available.


Is that the kind of design you had in mind?

That seems like a reasonable option as well. We already have some 
lightweight Apache servers in use on the project which currently just 
serve static content.


   Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread PFC


Thanks a lot, all of you - this is excellent advice. With the data  
clustered and statistics at a more reasonable value of 100, it now  
reproducibly takes even less time - 20-57 ms per query.


1000x speedup with proper tuning - always impressive, lol.
IO seeks are always your worst enemy.

After reading the section on "Statistics Used By the Planner" in the  
manual, I was a little concerned that, while the statistics sped up the  
queries that I tried immeasurably, that the most_common_vals array was  
where the speedup was happening, and that the values which wouldn't fit  
in this array wouldn't be sped up. Though I couldn't offhand find an  
example where this occurred, the clustering approach seems intuitively  
like a much more complete and scalable solution, at least for a  
read-only table like this.


	Actually, with statistics set to 100, then 100 values will be stored in  
most_common_vals. This would mean that the values not in most_common_vals  
will have less than 1% frequency, and probably much less than that. The  
choice of plan for these rare values is pretty simple.


	With two columns, "interesting" stuff can happen, like if you have col1  
in [1...10] and col2 in [1...10] and use a condition on col1=const and  
col2=const, the selectivity of the result depends not only on the  
distribution of col1 and col2 but also their correlation.


	As for the tests you did, it's hard to say without seeing the explain  
analyze outputs. If you change the stats and the plan choice (EXPLAIN)  
stays the same, and you use the same values in your query, any difference  
in timing comes from caching, since postgres is executing the same plan  
and therefore doing the exact same thing. Caching (from PG and from the  
OS) can make the timings vary a lot.


- Trying the same constant a second time gave an instantaneous result,  
I'm guessing because of query/result caching.


	PG does not cache queries or results. It caches data & index pages in its  
shared buffers, and then the OS adds another layer of the usual disk cache.
	A simple query like selecting one row based on PK takes about 60  
microseconds of CPU time, but if it needs one seek for the index and one  
for the data it may take 20 ms waiting for the moving parts to move...  
Hence, CLUSTER is a very useful tool.


	Bitmap index scans love clustered tables because all the interesting rows  
end up being grouped together, so much less pages need to be visited.


- I didn't try decreasing the statistics back to 10 before I ran the  
cluster command, so I can't show the search times going up because of  
that. But I tried killing the 500 meg process. The new process uses less  
than 5 megs of ram, and still reproducibly returns a result in less than  
60 ms. Again, this is with a statistics value of 100 and the data  
clustered by gene_prediction_view_gene_ref_key.


Killing it or just restarting postgres ?
	If you let postgres run (not idle) for a while, naturally it will fill  
the RAM up to the shared_buffers setting that you specified in the  
configuration file. This is good, since grabbing data from postgres' own  
cache is faster than having to make a syscall to the OS to get it from the  
OS disk cache (or disk). This isn't bloat.
	But what those 500 MB versus 6 MB show is that before, postgres had to  
read a lot of data for your query, so it stayed in the cache ; after  
tuning it needs to read much less data (thanks to CLUSTER) so the cache  
stays empty.



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple "= constant" queries using indexes

2008-04-10 Thread Gaetano Mendola
John Beaver wrote:

> - Trying the same constant a second time gave an instantaneous result,
> I'm guessing because of query/result caching.

AFAIK no query/result caching is in place in postgres, what you are experiencing
is OS disk/memory caching.


Regards
Gaetano Mendola

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Performance is low Postgres+Solaris

2008-04-10 Thread MUNAGALA REDDY
Hi All,

We are using solaris 10 x86/AMD Opteron and postgresql
8.2 on SunFire X2100 , however performance is very
slow in contrast to linux debian in the same platform.
Is this normal?

Thanks & Regards
Mahi


  

You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.  
http://tc.deals.yahoo.com/tc/blockbuster/text5.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Looking for bottleneck during load test

2008-04-10 Thread Ivan Voras
Hell, Robert wrote:
> I tried different other tools for random IO (including a self written one 
> which does random lseek and read).
> 
> This tool, started during one of our tests, achieves 2 iops (8k each).
> Started alone I get something about 1,500 iops with an avg latency of 100 ms.

1500 iops looks about right for 4x2 RAID 10 volume. What's your worst
latency (as reported by the tool)? iowait is mostly seek time.

> We are using SAN (EMC CLARiiON CX 300) - are those ~7 MB/s really our 
> bottleneck?

Depending on your access pattern to the database, it could be (if you
have lots of random IO, and 180 concurrent database threads can make any
IO random enough). Are your queries read-mostly or a mix?

> Any other tuning ideas?

Only generic ones:

- Are your queries optimized, use indexes, etc.?
- Try PostgreSQL 8.3 - if you have sequential seeks it can in theory
make better use of data between connections.
- Do you have enough memory dedicated to data caches, both in PostgreSQL
and in the OS? (i.e. what is your shared_buffers setting?)
- If the SAN can configure parameters such as prefetch (pre-read) and
stripe size, try lowering them (should help if you have random IO).



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Views and functions returning sets of records

2008-04-10 Thread Rajashri Tupe
Can we write retrieving only 10 records from 4000 records
plz tell me asap

On Mar 23, 8:28 pm, [EMAIL PROTECTED] (Tom Lane) wrote:
> Giorgio Valoti <[EMAIL PROTECTED]> writes:
> > Are there any way to pass some hints to the planner? For example,
> > could the IMMUTABLE/STABLE/VOLATILE modifiers be of some help?
>
> Those don't really do anything for set-returning functions at the
> moment.
>
> As of 8.3 there is a ROWS attribute for SRFs that can help with one
> of the worst problems, namely that the planner has no idea how many
> rows a SRF might return.  It's simplistic (just an integer constant
> estimate) but better than no control at all.
>
> As of CVS HEAD (8.4 to be) there's a capability in the planner to
> "inline" SRFs that are single SELECTs in SQL language, which should
> pretty much eliminate the performance differential against a comparable
> view.  Unfortunately 8.4 release is at least a year away, but just
> so you know.  (I suppose if you were desperate enough to run a privately
> modified copy, that patch should drop into 8.3 easily enough.)  IIRC
> the restrictions for this to happen are
> * single SELECT
> * function declared to return set
> * function NOT declared strict or volatile
> * function NOT declared SECURITY DEFINER or given any
>   local parameter settings
> The latter restrictions are needed so that inlining doesn't change
> the semantics.
>
> regards, tom lane
>
> -
> Sent via pgsql-performance mailing list ([EMAIL PROTECTED])
> To make changes to your 
> subscription:http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] shared_buffers in 8.2.x

2008-04-10 Thread Gaetano Mendola
Hi all,
specifing as shared_buffers = 26800 in 8.2.x will this value accepted like
in the 8.1.x series and then  26800*8192 bytes = 209 MB or 26800 bytes
(not being specified the memory unit)?

Regards
Gaetano Mendola

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] shared_buffers in 8.2.x

2008-04-10 Thread Erik Jones

On Apr 10, 2008, at 7:39 AM, Gaetano Mendola wrote:

Hi all,
specifing as shared_buffers = 26800 in 8.2.x will this value  
accepted like

in the 8.1.x series and then  26800*8192 bytes = 209 MB or 26800 bytes
(not being specified the memory unit)?


With no specified unit then it defaults to 8K.

Erik Jones

DBA | Emma®
[EMAIL PROTECTED]
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance is low Postgres+Solaris

2008-04-10 Thread Reid Thompson

Reid Thompson wrote:

MUNAGALA REDDY wrote:

Hi All,

We are using solaris 10 x86/AMD Opteron and postgresql
8.2 on SunFire X2100 , however performance is very
slow in contrast to linux debian in the same platform.
Is this normal?

Thanks & Regards
Mahi


http://www.google.com/search?q=postgresql+solaris+tuning&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a 




http://wikis.sun.com/display/DBonSolaris/PostgreSQL

http://www.sun.com/bigadmin/features/articles/postgresql_opensolaris.jsp

http://www.google.fr/search?q=site%3Asun.com+postgresql+tuning&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance is low Postgres+Solaris

2008-04-10 Thread Reid Thompson

MUNAGALA REDDY wrote:

Hi All,

We are using solaris 10 x86/AMD Opteron and postgresql
8.2 on SunFire X2100 , however performance is very
slow in contrast to linux debian in the same platform.
Is this normal?

Thanks & Regards
Mahi


http://www.google.com/search?q=postgresql+solaris+tuning&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance