Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha



On Wed, 31 Mar 2010, Faheem Mitha wrote:


On Tue, 30 Mar 2010, Robert Haas wrote:


On Tue, Mar 30, 2010 at 12:30 PM, Faheem Mitha fah...@email.unc.edu



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.


There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?


Certainly, but I don't know what you and Robert have in mind, and I'm not
experienced enough to make an educated guess. I'm open to specific
suggestions.


Try creating an index on geno on the columns that are being used for the 
join.


Ok, I'll try that. I guess the cols in question on geno are idlink_id and 
anno_id. I thought that I already had indexes on them, but no. Maybe I had 
indexes, but removed them.


Looking at this more closely, idlink_id and anno_id are primary keys, so 
already have indexes on them, so my understanding (from the docs) is there 
is no purpose in creating them. That's why I removed the indexes that were 
there (back last August, actually, according to my logs). Anyway, doesn't 
look there is anything I can do here. Does anyone have additions or 
corrections to this?


  Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha


Hi Eliot,

Thanks for the comment.

On Thu, 1 Apr 2010, Eliot Gable wrote:


On Thu, Apr 1, 2010 at 7:46 AM, Faheem Mitha fah...@email.unc.edu wrote:


Looking at this more closely, idlink_id and anno_id are primary keys, so 
already have indexes on them, so my understanding (from the docs) is 
there is no purpose in creating them. That's why I removed the indexes 
that were there (back last August, actually, according to my logs). 
Anyway, doesn't look there is anything I can do here. Does anyone have 
additions or corrections to this?


When you do a join, you typically have a foreign key in one table 
referencing a primary key in another table. While designating a foreign 
key does put a constraint on the key to ensure referential integrity, it 
does not put an index on the column that is being designated as a 
foreign key. If I understand correctly, the scan done as the inner loop 
of the nested loop scan for the join is going to be your foreign key 
column, not your primary key column. Thus, if you have no index on the 
foreign key column, you will be forced to do a sequential table scan to 
do the join. In that case the hash-based join will almost certainly be 
faster (especially for such a large number of rows). If you put an index 
on the foreign key, then the inner scan can be an index scan and that 
might turn out to be faster than building the hash indexes on all the 
table rows.



Somebody can correct me if I'm wrong.


I had set the foreign keys in question (on the geno table) to be primary 
keys. This is because this setup is basically a glorified spreadsheet, and 
I don't want more than one cell corresponding to a particular tuple of 
idlink.id and anno.id (the conceptual rows and cols). Since a primary key 
defines an index, I thought putting indexes on idlink_id and anno_id was 
redundant. However, it looks like (unsurprisingly) the index corresponding 
to the primary key is across both columns, which may not be what is wanted 
for the aforesaid join. Ie.


ALTER TABLE ONLY geno ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id)

(As a side comment, with respect to the indexes on the other side of the 
joins, in one case, we have idlink.id = geno.idlink_id, and idlink.id is a 
primary key too. In the other, namely geno.anno_id = 
dedup_patient_anno.id, dedup_patient_anno is a CTE, so no index on 
dedup_patient_anno.id. But maybe indexes aren't needed there.)


Here is the join

   SELECT decode_genotype(geno.snpval_id, %(allelea)s, %(alleleb)s) AS g,
 geno.idlink_id, geno.anno_id
 FROMgeno
 INNER JOIN dedup_patient_anno
 ON  geno.anno_id = dedup_patient_anno.id
 INNER JOIN idlink
 ON  geno.idlink_id = idlink.id
 ORDER BY idlink_id, anno_id

Here is the table dump.


-- Name: geno; Type: TABLE; Schema: hapmap; Owner: snp; Tablespace:
--
CREATE TABLE geno (
idlink_id integer NOT NULL,
anno_id integer NOT NULL,
snpval_id integer NOT NULL
)
WITH (autovacuum_enabled=true);

ALTER TABLE hapmap.geno OWNER TO snp;
--
-- Name: geno_pkey; Type: CONSTRAINT; Schema: hapmap; Owner: snp; 
Tablespace:

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_pkey PRIMARY KEY (idlink_id, anno_id); ()
--
-- Name: geno_anno_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_anno_id_fkey FOREIGN KEY (anno_id) REFERENCES 
anno(id) ON UPDATE CASCADE ON DELETE CASCADE;

--
-- Name: geno_idlink_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_idlink_id_fkey FOREIGN KEY (idlink_id) REFERENCES 
idlink(id) ON UPDATE CASCADE ON DELETE CASCADE;

--
-- Name: geno_snpval_id_fkey; Type: FK CONSTRAINT; Schema: hapmap; Owner: 
snp

--
ALTER TABLE ONLY geno
ADD CONSTRAINT geno_snpval_id_fkey FOREIGN KEY (snpval_id) REFERENCES 
snpval(val) ON UPDATE CASCADE ON DELETE CASCADE;

*

So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?

  Regards, Faheem.


--
Eliot Gable

We do not inherit the Earth from our ancestors: we borrow it from our 
children. ~David Brower


I decided the words were too conservative for me. We're not borrowing 
from our children, we're stealing from them--and it's not even 
considered to be a crime. ~David Brower


Nice quotes.

Esse oportet ut vivas, non vivere ut edas. (Thou shouldst eat to live; 
not live to eat.) ~Marcus Tullius Cicero



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha



On Thu, 1 Apr 2010, Robert Haas wrote:


On Thu, Apr 1, 2010 at 2:15 PM, Faheem Mitha fah...@email.unc.edu wrote:


I had set the foreign keys in question (on the geno table) to be 
primary keys. This is because this setup is basically a glorified 
spreadsheet, and I don't want more than one cell corresponding to a 
particular tuple of idlink.id and anno.id (the conceptual rows and 
cols). Since a primary key defines an index, I thought putting indexes 
on idlink_id and anno_id was redundant. However, it looks like 
(unsurprisingly) the index corresponding to the primary key is across 
both columns, which may not be what is wanted for the aforesaid join



Actually it is what is wanted - that is good.


I see.


So, should I add indexes on the individual foreign key cols idlink_id
and anno_id after all?


I doubt that would help.


You're sure of this?

The bottom line may be that you're dealing with hundreds of millions of 
rows here, so things are going to take a long time.  Of course you can 
always get more/faster memory, a bigger I/O subsystem, faster 
processors...  and it could be that with detailed study there are 
optimizations that could be done even without spending money, but I 
think I'm about tapped out on what I can do over an Internet mailing 
list.


Thanks for your assistance, Robert. It's been educational.

  Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-04-01 Thread Faheem Mitha



On Thu, 1 Apr 2010, Eliot Gable wrote:




On Thu, Apr 1, 2010 at 3:01 PM, Faheem Mitha fah...@email.unc.edu wrote:

  So, should I add indexes on the individual foreign key cols 
idlink_id
  and anno_id after all?


I doubt that would help.


You're sure of this?


It is always best to test and be certain.


Fair enough. I may also try disabling hash joins and see what happens...

  Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-31 Thread Faheem Mitha



On Wed, 31 Mar 2010, Matthew Wakeling wrote:


On Tue, 30 Mar 2010, Faheem Mitha wrote:

work_mem = 1 GB (see diag.{tex/pdf}).


Sure, but define sane setting, please. I guess part of the point is that 
I'm trying to keep memory low


You're trying to keep memory usage low, but you have work_mem set to 1GB?


I'm trying to keep both runtime and memory usage low. I assume that with 
lower levels of memory, the runtime would be longer, other things being 
equal.

 Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-30 Thread Faheem Mitha



On Tue, 30 Mar 2010, Kevin Grittner wrote:


Faheem Mitha fah...@email.unc.edu wrote:


If you're concerned about memory usage, try reducing work_mem;
you've probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question
is, why are these changes affecting memory usage so drastically?


Because the planner looks at a very wide variety of plans, some of
which may use many allocations of work_mem size, and some of which
don't.  The costs are compared and the lowest cost one is chosen. If
you are close to the tipping point then even a very small change
might affect which is chosen.  It pays to keep the work_mem setting
sane so that unexpected plan changes don't cause problems.


Sure, but define sane setting, please. I guess part of the point is that 
I'm trying to keep memory low, and it seems this is not part of the 
planner's priorities. That it, it does not take memory usage into 
consideration when choosing a plan. If that it wrong, let me know, but 
that is my understanding.



Look at the plans and their costs to get a feel for what's being
chosen and why.  Although it's a very bad idea to use these in
production, you can often shift the plan to something you *think*
would be better using the enable_* settings, to see what the planner
thinks such a plan will cost and where it thinks the cost would be;
that can help in tuning the settings.


Right. You mean to close off certain options to the planner using 'Planner 
Method Configuration'. I suppose one can also use 'Planner Cost Constants' 
to alter plan behaviour. I haven't tried changing these.



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.


There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?


Certainly, but I don't know what you and Robert have in mind, and I'm not 
experienced enough to make an educated guess. I'm open to specific 
suggestions.


 Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-30 Thread Faheem Mitha


On thing which I haven't really mentioned in this thread or in my writeup, 
is that the planners value for the number of rows in geno is way off base 
some of the time. It is around 800 million, it thinks it is 100 million. I 
don't know if this is significant or not, or what to do about it.


eg. in the ped_bigjoin EXPLAIN ANALYZE VERBOSE:

  -  Sort  (cost=56855882.72..57144683.54 rows=115520330 width=42) 
(actual time=23027732.092..37113627.380 rows=823086774 loops=1)
Output: (CASE WHEN (hapmap.geno.snpval_id = (-1)) THEN '0 0'::text 
WHEN (hapmap.geno.snpval_id = 0) THEN 
(((dedup_patient_anno.allelea_id)::text || ' '::text) || 
(dedup_patient_anno.allelea_id)::text) WHEN (hapmap.geno.snpval_id = 1) 
THEN (((dedup_patient_anno.allelea_id)::text || ' '::text) || 
(dedup_patient_anno.alleleb_id)::text) WHEN (hapmap.geno.snpval_id = 2) 
THEN (((dedup_patient_anno.alleleb_id)::text || ' '::text) || 
(dedup_patient_anno.alleleb_id)::text) ELSE NULL::text END), 
hapmap.geno.idlink_id, hapmap.geno.anno_id, pheno.patientid, 
pheno.phenotype, sex.code


  Faheem.

On Tue, 30 Mar 2010, Kevin Grittner wrote:


Faheem Mitha fah...@email.unc.edu wrote:


If you're concerned about memory usage, try reducing work_mem;
you've probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question
is, why are these changes affecting memory usage so drastically?


Because the planner looks at a very wide variety of plans, some of
which may use many allocations of work_mem size, and some of which
don't.  The costs are compared and the lowest cost one is chosen. If
you are close to the tipping point then even a very small change
might affect which is chosen.  It pays to keep the work_mem setting
sane so that unexpected plan changes don't cause problems.

Look at the plans and their costs to get a feel for what's being
chosen and why.  Although it's a very bad idea to use these in
production, you can often shift the plan to something you *think*
would be better using the enable_* settings, to see what the planner
thinks such a plan will cost and where it thinks the cost would be;
that can help in tuning the settings.


You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the
tables and the planner does table scans, so conventional wisdom
and indeed my experience, says that indexes are not going to be so
useful.


There are situations where scanning the entire table to build up a
hash table is more expensive than using an index.  Why not test it?

-Kevin




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Faheem Mitha



On Mon, 29 Mar 2010, Robert Haas wrote:


On Thu, Mar 25, 2010 at 3:57 PM, Faheem Mitha fah...@email.unc.edu wrote:


Hi everyone,

I've been trying to reduce both memory usage and runtime for a query.
Comments/suggestions gratefully received. Details are at

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

See particularly Section 1 - Background and Discussion.

If you want a text version, see

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

For background see

http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version
http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and
http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf

Please CC any replies to me at the above email address. Thanks.


Didn't you (or someone) post about these queries before?


I did write to the list about an earlier version of these queries, yes. In 
fact you replied to that message.



It's not really too clear to me from reading this what specific
questions you're trying to answer.


Quote from opt.{tex/pdf}, Section 1:

If I have to I can use Section~\ref{ped_hybrid} and 
Section~\ref{tped_hybrid}, but I am left wondering why I get the 
performance I do out of the earlier versions. Specifically, why is 
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and 
why does the memory usage in Section~\ref{ped_phenoout} blow up relative 
to Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?



One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.


Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.

If you're concerned about memory usage, try reducing work_mem; you've 
probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why 
are these changes affecting memory usage so drastically?



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the tables and 
the planner does table scans, so conventional wisdom and indeed my 
experience, says that indexes are not going to be so useful.


Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Faheem Mitha



On Mon, 29 Mar 2010, Robert Haas wrote:


On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha fah...@email.unc.edu wrote:

It's not really too clear to me from reading this what specific
questions you're trying to answer.


Quote from opt.{tex/pdf}, Section 1:

If I have to I can use Section~\ref{ped_hybrid} and
Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
I do out of the earlier versions. Specifically, why is
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?


Here and in the document, you refer to section numbers for the
hybrid version but I don't see where you define what the hybrid
version actually is.


It is defined later in the file. I don't know if you are looking at the 
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text 
file, I guess the easist way would be to grep for the label ped_hybrid.


And the differences between your queries are not real clear either - 
first you say you took out pheno and sex because they weren't necessary, 
but then you decide to put them back.  I don't know what that means. 
If they're not necessary, leave them out.


I don't see where I say that pheno and sex weren't necessary. In fact, the 
word 'necessary' does not appear in the opt document. I took them out to 
see how it would affect performance. Which is does, dramatically. I say


So, I decided to remove the joins to tables corresponding to the patient 
data, namely pheno and sex, and the runtime dropped to 150 min, while the 
memory stayed around 5G.


Maybe I wasn't being sufficiently explicit here. Perhaps

So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance...

would have been better.


One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.


Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.


Hmm, you might be right.


If you're concerned about memory usage, try reducing work_mem; you've
probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why
are these changes affecting memory usage so drastically?


Well each sort or hash can use an amount of memory that is limited
from above by work_mem.  So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead.  However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space.  RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.



You might need to create some indices, too.



Ok. To what purpose? This query picks up everything from the tables and the
planner does table scans, so conventional wisdom and indeed my experience,
says that indexes are not going to be so useful.


Well, a hash join is not usually the first thing that pops to mind when 
dealing with a table that has 825 million rows (geno).  I don't know if 
a nested loop with inner-indexscan would be faster, but it would almost 
certainly use less memory.


Can you provide an illustration of what you mean? I don't know what a 
nested loop with inner-indexscan is in this context.


   Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] experiments in query optimization

2010-03-25 Thread Faheem Mitha


Hi everyone,

I've been trying to reduce both memory usage and runtime for a query. 
Comments/suggestions gratefully received. Details are at


http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

See particularly Section 1 - Background and Discussion.

If you want a text version, see

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

For background see

http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version 
http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and 
http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf


Please CC any replies to me at the above email address. Thanks.

 Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query optimization

2009-11-27 Thread Faheem Mitha



On Wed, 25 Nov 2009, Robert Haas wrote:


On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha fah...@email.unc.edu wrote:



Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
I've had good luck with SELECT DISTINCT ON (...) in the past, as
compared with other methods.  YMMV - the only way to find out is to
benchmark it.  I don't think it's non-deterministic if you order by
the DISTINCT-ON columns and enough extras to break any ties - you
should get the first one of each set.


Right, but adding enough extras to break ties is up to the user, and the 
language doesn't guarantee anything, so it feels more fragile.



If not, you might want to look at some way of pre-marking the
non-duplicate rows so that you don't have to recompute that each time.


What are the options re pre-marking?


Well, what I usually do is - if I'm going to do the same
distinct-ification frequently, I add an extra column (say, a boolean)
and set it to true for all and only those rows which will pass the
distinct-ification filter.  Then I can just say WHERE that column
name.


Yes, I see. The problem with is premarking is that the selection is 
somewhat dynamic, in the sense that this depends on the idlink table, 
which depends on patient data, which can change.



Then you might be able to use the underlying table directly in the next
CTE, which will usually permit better optimization, more use of indices,
etc.  It seems pretty unfortunate that dedup_patient_anno joins against geno
and then patient_geno does what appears to be the same join again. Is there
some way to eliminate that?  If so it will probably help.


You don't say whether you are looking at the PED or TPED query, so I'll
assume PED. They are similar anyway.

I see your point re the joins. You mean

anno INNER JOIN geno

followed by

geno INNER JOIN dedup_patient_anno

? I think the point of the first join is to reduce the anno table based on
information from the geno table. The result is basically a subset of the
anno table with some potential duplication removed, which is then re-joined
to the geno table. I agree this seems a bit suboptimal, and there might be a
better way to do this.


Yeah, I didn't think about it in detail, but it looks like it should
be possible.  Eliminating joins can sometimes have *dramatic* effects
on query performance, and it never hurts.


Failing all else, couldn't I smoosh together the two queries and do a 
triple join? For reference, the two CTEs in question, from the PED query, 
are as follows.


dedup_patient_anno AS
 ( SELECT *
 FROM
 (SELECT  *,
  row_number() OVER(PARTITION BY anno.rsid ORDER BY 
anno.id)

 FROM anno
  INNER JOIN geno
  ON   anno.id = geno.anno_id
 WHEREidlink_id=
  (SELECT MIN(id)
  FROMidlink
  )
 ) AS s
 WHERE   row_number = '1'
 ),
 patient_geno AS
 ( SELECT geno.idlink_id AS idlink_id,
   geno.anno_id AS anno_id,
   geno.snpval_id AS snpval_id,
   allelea_id, alleleb_id
   FROMgeno
 INNER JOIN dedup_patient_anno
 ON  geno.anno_id = dedup_patient_anno.id
 ),

  Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query optimization

2009-11-25 Thread Faheem Mitha


Hi Robert,

Thanks very much for your suggestions.

On Wed, 25 Nov 2009, Robert Haas wrote:


On Mon, Nov 23, 2009 at 5:47 PM, Faheem Mitha fah...@email.unc.edu wrote:


Hi everybody,

I've got two queries that needs optimizing. Actually, there are others, 
but these are pretty representative.


You can see the queries and the corresponding plans at

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

or

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

if you prefer text (latex file, effectively text in this case)

The background to this is at
http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf

If more details are required, let me know and I can add them. I'd appreciate
suggestions about how to make these queries go faster.

Please CC this email address on any replies.


I've found that a good way to approach optimizing queries of this type
is to look at the EXPLAIN ANALYZE results and figure out which parts
of the query are slow.  Then simplify the rest of the query as much as
possible without eliminating the slowness.  Then try to figure out how
to optimize the simplified query: rewrite the logic, add indices,
change the schema, etc.  Lastly start adding the other bits back in.


Good strategy. Now I just have to understand EXPLAIN ANALYZE well enough 
to figure out which bits are slow. :-)



It looks like the dedup_patient_anno CTE is part of your problem.  Try
pulling that piece out and optimizing it separately.  I wonder if that
could be rewritten to use SELECT DISTINCT ON (...) and whether that
would be any faster.


Isn't SELECT DISTINCT supposed to be evil, since in general the result is 
not deterministic? I think I had SELECT DISTINCT earlier, and removed it 
because of that, with the help of Andrew (RhodiumToad on #postgresql) I 
didn't compare the corresponding subqueries separately, so don't know what 
speed difference this made.


If not, you might want to look at some way of pre-marking the 
non-duplicate rows so that you don't have to recompute that each time.


What are the options re pre-marking?

Then you might be able to use the underlying table directly in the next 
CTE, which will usually permit better optimization, more use of indices, 
etc.  It seems pretty unfortunate that dedup_patient_anno joins against 
geno and then patient_geno does what appears to be the same join again. 
Is there some way to eliminate that?  If so it will probably help.


You don't say whether you are looking at the PED or TPED query, so I'll 
assume PED. They are similar anyway.


I see your point re the joins. You mean

anno INNER JOIN geno

followed by

geno INNER JOIN dedup_patient_anno

? I think the point of the first join is to reduce the anno table based on 
information from the geno table. The result is basically a subset of the 
anno table with some potential duplication removed, which is then 
re-joined to the geno table. I agree this seems a bit suboptimal, and 
there might be a better way to do this.



Once you've got those parts of the query as well-optimized as you can,
add the next pieces in and start hacking on those.


  Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] query optimization

2009-11-23 Thread Faheem Mitha


Hi everybody,

I've got two queries that needs optimizing. Actually, there are others, 
but these are pretty representative.


You can see the queries and the corresponding plans at

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.pdf

or

http://bulldog.duhs.duke.edu/~faheem/snpdb/opt.tex

if you prefer text (latex file, effectively text in this case)

The background to this is at 
http://bulldog.duhs.duke.edu/~faheem/snpdb/diag.pdf


If more details are required, let me know and I can add them. I'd 
appreciate suggestions about how to make these queries go faster.


Please CC this email address on any replies.

   Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query optimization

2009-11-23 Thread Faheem Mitha



On Mon, 23 Nov 2009, Thom Brown wrote:


Hi Faheem,

There appears to be a discrepancy between the 2 PDFs you provided.  One 
says you're using PostgreSQL 8.3, and the other shows you using common 
table expressions, which are only available in 8.4+.


Yes, sorry. I'm using Postgresql 8.4. I guess I should go through diag.pdf 
and make sure all the information is current. Thanks for pointing out my 
error.


  Regards, Faheem.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] query optimization

2009-11-23 Thread Faheem Mitha



On Tue, 24 Nov 2009, Sebastian Jörgensen wrote:


How often are the tables you query from updated?


Quite rarely. Once in a while. The large tables, eg. geno, are basically 
static.


 Regards, Faheem.


Rgds
Sebastian

On Tue, Nov 24, 2009 at 12:52 AM, marcin mank marcin.m...@gmail.com wrote:
  On Tue, Nov 24, 2009 at 12:49 AM, Faheem Mitha fah...@email.unc.edu 
wrote:
  
   Yes, sorry. I'm using Postgresql 8.4. I guess I should go through 
diag.pdf
   and make sure all the information is current. Thanks for pointing out my
   error.
  

excellent report!

about the copy problem: You seem to have created the primary key
before doing the copy (at least that`s what the dump before copy
says). This is bad. Create it after the copy.

Greetings
Marcin

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance