Re: [PERFORM] Why Wal_buffer is 64KB

2010-03-29 Thread Tadipathri Raghu
Hi All,

Thank you for all the support.

I have noticed one more thing here, that if you turn off the fsync and try
to run the transaction than its breaking the currnet filenode and generating
another filenode. Is it true that whenever you turn off or on the fsync the
filenode will break and create one more on that table.

Regards
Raghavendra

On Fri, Mar 26, 2010 at 7:30 PM, Scott Marlowe scott.marl...@gmail.comwrote:

 On Fri, Mar 26, 2010 at 7:43 AM, Pierre C li...@peufeu.com wrote:
 
  After fsync/syncronous_commit off
 
  Do not use fsync off, it is not safe. Who cares about the performance of
  fsync=off, when in practice you'd never use it with real data.
  synchronnous_commit=off is fine for some applications, though.

 There are situations where it's ok, when all the data are
 reproduceable from other sources, etc.  for instance I have a
 reporting server that is a slony slave that runs with fsync off.  If
 it does crash and I can recreate the node in an hour or so and be back
 online.  With fsync off the machine is too slow to do its job, and
 it's not the primary repo of the real data, so it's ok there.



Re: [PERFORM] Why Wal_buffer is 64KB

2010-03-29 Thread Scott Marlowe
On Mon, Mar 29, 2010 at 12:00 AM, Tadipathri Raghu traghu@gmail.com wrote:
 Hi All,

 Thank you for all the support.

 I have noticed one more thing here, that if you turn off the fsync and try
 to run the transaction than its breaking the currnet filenode and generating
 another filenode. Is it true that whenever you turn off or on the fsync the
 filenode will break and create one more on that table.

From what I understand, with fsync on or off the same stuff gets
written.  It's just not guaranteed to go out in the right order or
right now, but eventually.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Why Wal_buffer is 64KB

2010-03-29 Thread Tadipathri Raghu
Hi Scott,

Yes, May i know any particular reason for behaving this. Are its looking for
any consistency. I havnt got any clear picture here.
Could you Please explain this..

Thanks  Regards
Raghavendra

On Mon, Mar 29, 2010 at 12:15 PM, Scott Marlowe scott.marl...@gmail.comwrote:

 On Mon, Mar 29, 2010 at 12:00 AM, Tadipathri Raghu traghu@gmail.com
 wrote:
  Hi All,
 
  Thank you for all the support.
 
  I have noticed one more thing here, that if you turn off the fsync and
 try
  to run the transaction than its breaking the currnet filenode and
 generating
  another filenode. Is it true that whenever you turn off or on the fsync
 the
  filenode will break and create one more on that table.

 From what I understand, with fsync on or off the same stuff gets
 written.  It's just not guaranteed to go out in the right order or
 right now, but eventually.



Re: [PERFORM] Optimizer showing wrong rows in plan

2010-03-29 Thread Matthew Wakeling

On Mon, 29 Mar 2010, Tadipathri Raghu wrote:

As per the documentation, one page is 8kb, when i create a table with int as
one column its 4 bytes. If i insert 2000 rows, it should be in one page only
as its 8kb, but its extending vastly as expected. Example shown below,
taking the previous example table test with one column.


There is more to a row than just the single int column. The space used by 
a column will include a column start marker (data length), transaction 
ids, hint bits, an oid, a description of the types of the columns, and 
finally your data columns. That takes a bit more space.


Matthew

--
If you let your happiness depend upon how somebody else feels about you,
now you have to control how somebody else feels about you. -- Abraham Hicks

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizer showing wrong rows in plan

2010-03-29 Thread raghavendra t
Hi Mattew,

Thank you for the information.

Once again, I like to thank each and everyone in this thread for there
ultimate support.

Regards
Raghavendra

On Mon, Mar 29, 2010 at 4:47 PM, Matthew Wakeling matt...@flymine.orgwrote:

 On Mon, 29 Mar 2010, Tadipathri Raghu wrote:

 As per the documentation, one page is 8kb, when i create a table with int
 as
 one column its 4 bytes. If i insert 2000 rows, it should be in one page
 only
 as its 8kb, but its extending vastly as expected. Example shown below,
 taking the previous example table test with one column.


 There is more to a row than just the single int column. The space used by a
 column will include a column start marker (data length), transaction ids,
 hint bits, an oid, a description of the types of the columns, and finally
 your data columns. That takes a bit more space.

 Matthew

 --
 If you let your happiness depend upon how somebody else feels about you,
 now you have to control how somebody else feels about you. -- Abraham Hicks

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



Re: [PERFORM] Optimizer showing wrong rows in plan

2010-03-29 Thread Nikolas Everett
See http://www.postgresql.org/docs/current/static/storage-page-layout.html for
all of what is taking up the space.  Short version:
 Per block overhead is  24 bytes
 Per row overhead is 23 bytes + some alignment loss + the null bitmap if you
have nullable columns

On Mon, Mar 29, 2010 at 8:24 AM, raghavendra t raagavendra@gmail.comwrote:

 Hi Mattew,

 Thank you for the information.

 Once again, I like to thank each and everyone in this thread for there
 ultimate support.

 Regards
 Raghavendra

 On Mon, Mar 29, 2010 at 4:47 PM, Matthew Wakeling matt...@flymine.orgwrote:

 On Mon, 29 Mar 2010, Tadipathri Raghu wrote:

 As per the documentation, one page is 8kb, when i create a table with int
 as
 one column its 4 bytes. If i insert 2000 rows, it should be in one page
 only
 as its 8kb, but its extending vastly as expected. Example shown below,
 taking the previous example table test with one column.


 There is more to a row than just the single int column. The space used by
 a column will include a column start marker (data length), transaction ids,
 hint bits, an oid, a description of the types of the columns, and finally
 your data columns. That takes a bit more space.

 Matthew

 --
 If you let your happiness depend upon how somebody else feels about you,
 now you have to control how somebody else feels about you. -- Abraham
 Hicks

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
 )
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance





[PERFORM] How much memory is PostgreSQL using

2010-03-29 Thread Campbell, Lance
PostgreSQL 8.4.3

OS: Linux Red Hat 4.x

 

I changed my strategy with PostgreSQL recently to use a large segment of
memory for shared buffers with the idea of caching disk blocks.  How can
I see how much memory PostgreSQL is using for this?

 

I tried:

 

ps aux | grep post | sort -k4

 

This lists the processes using memory at the bottom.  Are they sharing
memory or using individual their own blocks of memory?

 

When I do top I see that VIRT is the value of my shared buffers plus a
tiny bit.  I see %MEM is only 2.4%, 2.6%, 1.0%,1.5%, and 1.1% for all of
the running processes.  Do I add these percentages up to see what amount
of VIRT I am really using? 

 

Or is there some way to ask PostgreSQL how much memory are you using to
cache disk blocks currently?

 

When you do a PG_DUMP does PostgreSQL put the disk blocks into shared
buffers as it runs? 

 

Thanks,

 

Lance Campbell

Software Architect/DBA/Project Manager

Web Services at Public Affairs

217-333-0382

 



[PERFORM] Performance regarding LIKE searches

2010-03-29 Thread randalls
Hi,

I am querying a Postgresql 8.3 database table that has approximately 22 million 
records.  The (explain analyze) query is listed below:

gdr_gbrowse_live= explain analyze SELECT 
f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand FROM feature as f, name 
as n WHERE (n.id=f.id AND lower(n.name) LIKE 
'Scaffold:scaffold_163:1000..1199%' AND n.display_name0);
   QUERY PLAN   


 Nested Loop  (cost=0.01..5899.93 rows=734 width=884) (actual time=0.033..0.033 
rows=0 loops=1)
   -  Index Scan using name_name_lower_pattern_ops_idx on name n  
(cost=0.01..9.53 rows=734 width=4) (actual time=0.032..0.032 rows=0 loops=1)
 Index Cond: ((lower((name)::text) ~=~ 'Scaffold:scaffold'::text) AND 
(lower((name)::text) ~~ 'Scaffold:scaffole'::text))
 Filter: ((display_name  0) AND (lower((name)::text) ~~ 
'Scaffold:scaffold_163:1000..1199%'::text))
   -  Index Scan using feature_pkey on feature f  (cost=0.00..8.01 rows=1 
width=884) (never executed)
 Index Cond: (f.id = n.id)
 Total runtime: 0.119 ms
(7 rows)

I can see I am hitting an index using an index that I created using the 
varchar_pattern_ops setting.  This is very fast and performs like I would 
expect.  However, when my application, GBrowse, access the database, I see in 
my slow query log this:

2010-03-29 09:34:38.083 
PDT,gdr_gbrowse_live,gdr_gbrowse_live,11649,10.0.0.235:59043,4bb0399d.2d81,8,SELECT,2010-03-28
 22:24:45 PDT,4/118607,0,LOG,0,duration: 21467.467 ms  execute 
dbdpg_p25965_9: SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
  FROM feature as f, name as n
  WHERE (n.id=f.id AND lower(n.name) LIKE $1)
  
,parameters: $1 = 'Scaffold:scaffold\_163:1000..1199%',,,

GBrowse is a perl based application.  Looking at the duration for this query is 
around 21 seconds.  That is a bit long.  Does anyone have any ideas why the 
query duration is so different?

Randall Svancara
Systems Administrator/DBA/Developer
Main Bioinformatics Laboratory



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] How much memory is PostgreSQL using

2010-03-29 Thread Greg Smith

Campbell, Lance wrote:


Or is there some way to ask PostgreSQL how much memory are you using 
to cache disk blocks currently?




You can install contrib/pg_buffercache into each database and count how 
many used blocks are there.  Note that running queries using that 
diagnostic tool is really intensive due to the locks it takes, so be 
careful not to do that often on a production system.



When you do a PG_DUMP does PostgreSQL put the disk blocks into shared 
buffers as it runs?




To some extent.  Most pg_dump activity involves sequential scans that 
are reading an entire table.  Those are no different from any other 
process that will put disk blocks into shared_buffers.  However, that 
usage pattern makes pg_dump particularly likely to run into an 
optimization in 8.3 and later that limits how much of shared_buffers is 
used when sequentially scanning a large table.  See P10 of 
http://www.westnet.com/~gsmith/content/postgresql/InsideBufferCache.pdf 
for the exact implementation.  Basically, anything bigger than 
shared_buffers / 4 uses a 256K ring to limit its cache use, but it's a 
little more complicated than that.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regarding LIKE searches

2010-03-29 Thread Tom Lane
randa...@bioinfo.wsu.edu writes:
 I can see I am hitting an index using an index that I created using the 
 varchar_pattern_ops setting.  This is very fast and performs like I would 
 expect.  However, when my application, GBrowse, access the database, I see in 
 my slow query log this:

 2010-03-29 09:34:38.083 
 PDT,gdr_gbrowse_live,gdr_gbrowse_live,11649,10.0.0.235:59043,4bb0399d.2d81,8,SELECT,2010-03-28
  22:24:45 PDT,4/118607,0,LOG,0,duration: 21467.467 ms  execute 
 dbdpg_p25965_9: SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
   FROM feature as f, name as n
   WHERE (n.id=f.id AND lower(n.name) LIKE $1)
  
 ,parameters: $1 = 'Scaffold:scaffold\_163:1000..1199%',,,

 GBrowse is a perl based application.  Looking at the duration for this query 
 is around 21 seconds.  That is a bit long.  Does anyone have any ideas why 
 the query duration is so different?

You're not going to get an index optimization when the LIKE pattern
isn't a constant (and left-anchored, but this is).

It is possible to get the planner to treat a query parameter as a
constant (implying a re-plan on each execution instead of having a
cached plan).  I believe what you have to do at the moment is use
unnamed rather than named prepared statements.  The practicality of
this would depend a lot on your client-side software stack, which
you didn't mention.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regarding LIKE searches

2010-03-29 Thread randalls
Tom,

We are using perl 5.10 with postgresql DBD.  Can you point me in the right 
direction in terms of unamed and named prepared statements?

Thanks,

Randall Svancara
Systems Administrator/DBA/Developer
Main Bioinformatics Laboratory



- Original Message -
From: Tom Lane t...@sss.pgh.pa.us
To: randa...@bioinfo.wsu.edu
Cc: pgsql-performance@postgresql.org
Sent: Monday, March 29, 2010 10:00:03 AM
Subject: Re: [PERFORM] Performance regarding LIKE searches 

randa...@bioinfo.wsu.edu writes:
 I can see I am hitting an index using an index that I created using the 
 varchar_pattern_ops setting.  This is very fast and performs like I would 
 expect.  However, when my application, GBrowse, access the database, I see in 
 my slow query log this:

 2010-03-29 09:34:38.083 
 PDT,gdr_gbrowse_live,gdr_gbrowse_live,11649,10.0.0.235:59043,4bb0399d.2d81,8,SELECT,2010-03-28
  22:24:45 PDT,4/118607,0,LOG,0,duration: 21467.467 ms  execute 
 dbdpg_p25965_9: SELECT f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
   FROM feature as f, name as n
   WHERE (n.id=f.id AND lower(n.name) LIKE $1)
  
 ,parameters: $1 = 'Scaffold:scaffold\_163:1000..1199%',,,

 GBrowse is a perl based application.  Looking at the duration for this query 
 is around 21 seconds.  That is a bit long.  Does anyone have any ideas why 
 the query duration is so different?

You're not going to get an index optimization when the LIKE pattern
isn't a constant (and left-anchored, but this is).

It is possible to get the planner to treat a query parameter as a
constant (implying a re-plan on each execution instead of having a
cached plan).  I believe what you have to do at the moment is use
unnamed rather than named prepared statements.  The practicality of
this would depend a lot on your client-side software stack, which
you didn't mention.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regarding LIKE searches

2010-03-29 Thread Andy Colson

On 3/29/2010 12:23 PM, randa...@bioinfo.wsu.edu wrote:

Tom,

We are using perl 5.10 with postgresql DBD.  Can you point me in the right 
direction in terms of unamed and named prepared statements?

Thanks,

Randall Svancara
Systems Administrator/DBA/Developer
Main Bioinformatics Laboratory



- Original Message -
From: Tom Lanet...@sss.pgh.pa.us
To: randa...@bioinfo.wsu.edu
Cc: pgsql-performance@postgresql.org
Sent: Monday, March 29, 2010 10:00:03 AM
Subject: Re: [PERFORM] Performance regarding LIKE searches

randa...@bioinfo.wsu.edu writes:

I can see I am hitting an index using an index that I created using the 
varchar_pattern_ops setting.  This is very fast and performs like I would 
expect.  However, when my application, GBrowse, access the database, I see in 
my slow query log this:



2010-03-29 09:34:38.083 
PDT,gdr_gbrowse_live,gdr_gbrowse_live,11649,10.0.0.235:59043,4bb0399d.2d81,8,SELECT,2010-03-28
 22:24:45 PDT,4/118607,0,LOG,0,duration: 21467.467 ms  execute dbdpg_p25965_9: SELECT 
f.id,f.object,f.typeid,f.seqid,f.start,f.end,f.strand
   FROM feature as f, name as n
   WHERE (n.id=f.id AND lower(n.name) LIKE $1)



,parameters: $1 = 'Scaffold:scaffold\_163:1000..1199%',,,



GBrowse is a perl based application.  Looking at the duration for this query is 
around 21 seconds.  That is a bit long.  Does anyone have any ideas why the 
query duration is so different?


You're not going to get an index optimization when the LIKE pattern
isn't a constant (and left-anchored, but this is).

It is possible to get the planner to treat a query parameter as a
constant (implying a re-plan on each execution instead of having a
cached plan).  I believe what you have to do at the moment is use
unnamed rather than named prepared statements.  The practicality of
this would depend a lot on your client-side software stack, which
you didn't mention.

regards, tom lane



I'm just going to guess, but DBD::Pg can do real prepare or fake 
prepare.


It does real by default.  Try setting:
$dbh-{pg_server_prepare} = 0;

before you prepare/run that statement and see if it makes a difference.

http://search.cpan.org/dist/DBD-Pg/Pg.pm#prepare


-Andy


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Robert Haas
On Thu, Mar 25, 2010 at 3:57 PM, Faheem Mitha fah...@email.unc.edu wrote:

 Hi everyone,

 I've been trying to reduce both memory usage and runtime for a query.
 Comments/suggestions gratefully received. Details are at

 http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

 See particularly Section 1 - Background and Discussion.

 If you want a text version, see

 http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

 For background see

 http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version
 http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and
 http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf

 Please CC any replies to me at the above email address. Thanks.

Didn't you (or someone) post about these queries before?

It's not really too clear to me from reading this what specific
questions you're trying to answer.  One random thought: WHERE
row_number() = 1 is not too efficient.  Try using LIMIT or DISTINCT ON
instead.

If you're concerned about memory usage, try reducing work_mem; you've
probably got it set to something huge.

You might need to create some indices, too.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Faheem Mitha



On Mon, 29 Mar 2010, Robert Haas wrote:


On Thu, Mar 25, 2010 at 3:57 PM, Faheem Mitha fah...@email.unc.edu wrote:


Hi everyone,

I've been trying to reduce both memory usage and runtime for a query.
Comments/suggestions gratefully received. Details are at

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.pdf

See particularly Section 1 - Background and Discussion.

If you want a text version, see

http://bulldog.duhs.duke.edu/~faheem/snppy/opt.tex

For background see

http://bulldog.duhs.duke.edu/~faheem/snppy/diag.pdf (text version
http://bulldog.duhs.duke.edu/~faheem/snppy/diag.tex) and
http://bulldog.duhs.duke.edu/~faheem/snppy/snppy.pdf

Please CC any replies to me at the above email address. Thanks.


Didn't you (or someone) post about these queries before?


I did write to the list about an earlier version of these queries, yes. In 
fact you replied to that message.



It's not really too clear to me from reading this what specific
questions you're trying to answer.


Quote from opt.{tex/pdf}, Section 1:

If I have to I can use Section~\ref{ped_hybrid} and 
Section~\ref{tped_hybrid}, but I am left wondering why I get the 
performance I do out of the earlier versions. Specifically, why is 
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and 
why does the memory usage in Section~\ref{ped_phenoout} blow up relative 
to Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?



One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.


Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.

If you're concerned about memory usage, try reducing work_mem; you've 
probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why 
are these changes affecting memory usage so drastically?



You might need to create some indices, too.


Ok. To what purpose? This query picks up everything from the tables and 
the planner does table scans, so conventional wisdom and indeed my 
experience, says that indexes are not going to be so useful.


Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Robert Haas
On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha fah...@email.unc.edu wrote:
 It's not really too clear to me from reading this what specific
 questions you're trying to answer.

 Quote from opt.{tex/pdf}, Section 1:

 If I have to I can use Section~\ref{ped_hybrid} and
 Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
 I do out of the earlier versions. Specifically, why is
 Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
 why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
 Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?

Here and in the document, you refer to section numbers for the
hybrid version but I don't see where you define what the hybrid
version actually is.  And the differences between your queries are not
real clear either - first you say you took out pheno and sex because
they weren't necessary, but then you decide to put them back.  I don't
know what that means.  If they're not necessary, leave them out.

 One random thought: WHERE row_number() = 1 is not too efficient.
 Try using LIMIT or DISTINCT ON instead.

 Possibly. However, the CTE that uses

 WHERE row_number() = 1

 doesn't dominate the runtime or memory usage, so I'm not too concerned
 about it.

Hmm, you might be right.

 If you're concerned about memory usage, try reducing work_mem; you've
 probably got it set to something huge.

 work_mem = 1 GB (see diag.{tex/pdf}).

 The point isn't that I'm using so much memory. Again, my question is, why
 are these changes affecting memory usage so drastically?

Well each sort or hash can use an amount of memory that is limited
from above by work_mem.  So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead.  However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space.  RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.

 You might need to create some indices, too.

 Ok. To what purpose? This query picks up everything from the tables and the
 planner does table scans, so conventional wisdom and indeed my experience,
 says that indexes are not going to be so useful.

Well, a hash join is not usually the first thing that pops to mind
when dealing with a table that has 825 million rows (geno).  I don't
know if a nested loop with inner-indexscan would be faster, but it
would almost certainly use less memory.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] experiments in query optimization

2010-03-29 Thread Faheem Mitha



On Mon, 29 Mar 2010, Robert Haas wrote:


On Mon, Mar 29, 2010 at 2:31 PM, Faheem Mitha fah...@email.unc.edu wrote:

It's not really too clear to me from reading this what specific
questions you're trying to answer.


Quote from opt.{tex/pdf}, Section 1:

If I have to I can use Section~\ref{ped_hybrid} and
Section~\ref{tped_hybrid}, but I am left wondering why I get the performance
I do out of the earlier versions. Specifically, why is
Section~\ref{ped_bigjoin} so much slower than Section~\ref{ped_trunc}, and
why does the memory usage in Section~\ref{ped_phenoout} blow up relative to
Section~\ref{ped_bigjoin} and Section~\ref{ped_trunc}?


Here and in the document, you refer to section numbers for the
hybrid version but I don't see where you define what the hybrid
version actually is.


It is defined later in the file. I don't know if you are looking at the 
pdf, but if so, it is Section 2.4 (for the hybrid PED query). In the text 
file, I guess the easist way would be to grep for the label ped_hybrid.


And the differences between your queries are not real clear either - 
first you say you took out pheno and sex because they weren't necessary, 
but then you decide to put them back.  I don't know what that means. 
If they're not necessary, leave them out.


I don't see where I say that pheno and sex weren't necessary. In fact, the 
word 'necessary' does not appear in the opt document. I took them out to 
see how it would affect performance. Which is does, dramatically. I say


So, I decided to remove the joins to tables corresponding to the patient 
data, namely pheno and sex, and the runtime dropped to 150 min, while the 
memory stayed around 5G.


Maybe I wasn't being sufficiently explicit here. Perhaps

So, I decided to remove the joins to tables corresponding to the patient
data, namely pheno and sex, to see how it would affect performance...

would have been better.


One random thought: WHERE row_number() = 1 is not too efficient.
Try using LIMIT or DISTINCT ON instead.


Possibly. However, the CTE that uses

WHERE row_number() = 1

doesn't dominate the runtime or memory usage, so I'm not too concerned
about it.


Hmm, you might be right.


If you're concerned about memory usage, try reducing work_mem; you've
probably got it set to something huge.


work_mem = 1 GB (see diag.{tex/pdf}).

The point isn't that I'm using so much memory. Again, my question is, why
are these changes affecting memory usage so drastically?


Well each sort or hash can use an amount of memory that is limited
from above by work_mem.  So if you write the query in a way that
involves more sorts or hashes, each one can add up to 1GB to your
memory usage, plus overhead.  However, it doesn't look like any of
your queries including 30 sorts or hashes, so I'm thinking that the
RSS number probably also includes some of the shared memory that has
been mapped into each backend's address space.  RSS is not a terribly
reliable number when dealing with shared memory; it's hard to say what
that really means.



You might need to create some indices, too.



Ok. To what purpose? This query picks up everything from the tables and the
planner does table scans, so conventional wisdom and indeed my experience,
says that indexes are not going to be so useful.


Well, a hash join is not usually the first thing that pops to mind when 
dealing with a table that has 825 million rows (geno).  I don't know if 
a nested loop with inner-indexscan would be faster, but it would almost 
certainly use less memory.


Can you provide an illustration of what you mean? I don't know what a 
nested loop with inner-indexscan is in this context.


   Regards, Faheem.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance