Re: [PERFORM] Does the psql executable support a fetch many approach when dumping large queries to stdout?

2010-04-02 Thread John Beaver




Ah, you're right. Thanks Hannu, that's a good solution.

Hannu Krosing wrote:

  On Fri, 2010-04-02 at 16:28 -0400, Beaver, John E wrote:
...

  
  
I know that the query used here could have been a COPY statement, which I assume would
 be better-behaved, but I'm more concerned about the case in which the query is more complex.

  
  
COPY can copy out results of a SELECT query as well.

  


-- 
John E. Beaver
Bioinformatics Developer
Harvard Medical School




Re: *** PROBABLY SPAM *** [PERFORM] Does the psql executable support a fetch many approach when dumping large queries to stdout?

2010-04-02 Thread John Beaver




That makes sense. I'll just use a COPY statement instead like Hannu
suggests.

Pierre C wrote:

  
Does the psql executable have any ability to do a "fetch many", using a  
server-side named cursor, when returning results? It seems like it tries  
to retrieve the query entirely to local memory before printing to  
standard out.

  
  
I think it grabs the whole result set to calculate the display column  
widths. I think there is an option to tweak this but don't remember which,  
have a look at the psql commands (\?), formatting section.
  


-- 
John E. Beaver
Bioinformatics Developer
Harvard Medical School




Re: [PERFORM] sequence scan problem

2008-06-30 Thread John Beaver
Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and 
I vacuumed-analyzed both tables directly after they were created.



# explain analyze select fls.function_verified, fls.score, 
fls.go_category_group_ref, fs1.gene_ref, fs1.function_verified_exactly, 
fs2.gene_ref, fs2.function_verified_exactly from 
functional_linkage_scores fls, gene_prediction_view fs1, 
gene_prediction_view fs2 where fls.gene_ref1 = fs1.gene_ref and 
fls.gene_ref2 = fs2.gene_ref and fs1.go_term_ref = 2 and fs2.go_term_ref 
= 2;
 
QUERY 
PLAN  


--
Merge Join  (cost=1399203593.41..6702491234.74 rows=352770803726 
width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1)

  Merge Cond: (fs2.gene_ref = fls.gene_ref2)
  -  Index Scan using gene_prediction_view_gene_ref on 
gene_prediction_view fs2  (cost=0.00..12111899.77 rows=197899 width=5) 
(actual time=29.592..469838.583 rows=180629 loops=1)

Index Cond: (go_term_ref = 2)
  -  Materialize  (cost=1399069432.20..1483728633.52 rows=6772736105 
width=21) (actual time=6370164.864..16623552.417 rows=15610535121 loops=1)
-  Sort  (cost=1399069432.20..1416001272.47 rows=6772736105 
width=21) (actual time=6370164.860..13081970.248 rows=1897946790 loops=1)

  Sort Key: fls.gene_ref2
  Sort Method:  external merge  Disk: 61192240kB
  -  Merge Join  (cost=40681244.97..154286110.62 
rows=6772736105 width=21) (actual time=592112.778..2043161.851 
rows=1897946790 loops=1)

Merge Cond: (fs1.gene_ref = fls.gene_ref1)
-  Index Scan using gene_prediction_view_gene_ref 
on gene_prediction_view fs1  (cost=0.00..12111899.77 rows=197899 
width=5) (actual time=0.015..246613.129 rows=180644 loops=1)

  Index Cond: (go_term_ref = 2)
-  Materialize  (cost=40586010.10..43490582.70 
rows=232365808 width=20) (actual time=592112.755..1121366.375 
rows=1897946783 loops=1)
  -  Sort  (cost=40586010.10..41166924.62 
rows=232365808 width=20) (actual time=592112.721..870349.308 
rows=232241678 loops=1)

Sort Key: fls.gene_ref1
Sort Method:  external merge  Disk: 
7260856kB
-  Seq Scan on 
functional_linkage_scores fls  (cost=0.00..3928457.08 rows=232365808 
width=20) (actual time=14.221..86455.902 rows=232241678 loops=1)

Total runtime: 24183346.271 ms
(18 rows)





Jeremy Harris wrote:

John Beaver wrote:
I'm having a strange problem with a query. The query is fairly 
simple, with a few constants and two joins. All relevant columns 
should be indexed, and I'm pretty sure there aren't any type 
conversion issues. But the query plan includes a fairly heavy seq 
scan. The only possible complication is that the tables involved are 
fairly large - hundreds of millions of rows each.


Can anyone explain this? There should only ever be a maximum of about 
50 rows returned when the query is executed.


You didn't say when you last vacuumed?
If there should only be 50 rows returned then the estimates from the
planner are way out.

If that doesn't help, we'll need version info, and (if you can afford
the time) an explain analyze

Cheers,
 Jeremy



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] sequence scan problem

2008-06-30 Thread John Beaver




chuckle You're right - for some reason I was looking at the (18
rows) at the bottom. Pilot error indeed - I'll have to figure out
what's going on with my data.

Thanks!

Tom Lane wrote:

  John Beaver [EMAIL PROTECTED] writes:
  
  
Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and 
I vacuumed-analyzed both tables directly after they were created.

  
  
  
  
 Merge Join  (cost=1399203593.41..6702491234.74 rows=352770803726 
width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1)

  
  ^^^

Weren't you saying that only 50 rows should be returned?  I'm thinking
the real problem here is pilot error: you missed out a needed join
condition or something.  SQL will happily execute underconstrained
queries ...

			regards, tom lane

  





Re: [PERFORM] sequence scan problem

2008-06-29 Thread John Beaver

Oh, and the version is 8.3.3.

Jeremy Harris wrote:

John Beaver wrote:
I'm having a strange problem with a query. The query is fairly 
simple, with a few constants and two joins. All relevant columns 
should be indexed, and I'm pretty sure there aren't any type 
conversion issues. But the query plan includes a fairly heavy seq 
scan. The only possible complication is that the tables involved are 
fairly large - hundreds of millions of rows each.


Can anyone explain this? There should only ever be a maximum of about 
50 rows returned when the query is executed.


You didn't say when you last vacuumed?
If there should only be 50 rows returned then the estimates from the
planner are way out.

If that doesn't help, we'll need version info, and (if you can afford
the time) an explain analyze

Cheers,
 Jeremy



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] PgPool parallel query performance rules of thumb

2008-06-04 Thread John Beaver

Hi,
   I'm trying to make use of a cluster of 40 nodes that my group has, 
and I'm curious if anyone has experience with PgPool's parallel query 
mode. Under what circumstances could I expect the most benefit from 
query parallelization as implemented by PgPool?


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple = constant queries using indexes

2008-04-10 Thread John Beaver
Thanks a lot, all of you - this is excellent advice. With the data 
clustered and statistics at a more reasonable value of 100, it now 
reproducibly takes even less time - 20-57 ms per query.


After reading the section on Statistics Used By the Planner in the 
manual, I was a little concerned that, while the statistics sped up the 
queries that I tried immeasurably, that the most_common_vals array was 
where the speedup was happening, and that the values which wouldn't fit 
in this array wouldn't be sped up. Though I couldn't offhand find an 
example where this occurred, the clustering approach seems intuitively 
like a much more complete and scalable solution, at least for a 
read-only table like this.


As to whether the entire index/table was getting into ram between my 
statistics calls, I don't think this was the case. Here's the behavior 
that I found:
- With statistics at 10, the query took 25 (or so) seconds no matter how 
many times I tried different values. The query plan was the same as for 
the 200 and 800 statistics below.
- Trying the same constant a second time gave an instantaneous result, 
I'm guessing because of query/result caching.
- Immediately on increasing the statistics to 200, the query took a 
reproducibly less amount of time. I tried about 10 different values
- Immediately on increasing the statistics to 800, the query 
reproducibly took less than a second every time. I tried about 30 
different values.
- Decreasing the statistics to 100 and running the cluster command 
brought it to 57 ms per query.
- The Activity Monitor (OSX) lists the relevant postgres process as 
taking a little less than 500 megs.
- I didn't try decreasing the statistics back to 10 before I ran the 
cluster command, so I can't show the search times going up because of 
that. But I tried killing the 500 meg process. The new process uses less 
than 5 megs of ram, and still reproducibly returns a result in less than 
60 ms. Again, this is with a statistics value of 100 and the data 
clustered by gene_prediction_view_gene_ref_key.


And I'll consider the idea of using triggers with an ancillary table for 
other purposes; seems like it could be a useful solution for something.


Matthew wrote:

On Thu, 10 Apr 2008, PFC wrote:

... Lots of useful advice ...

- If you often query rows with the same gene_ref, consider using 
CLUSTER to physically group those rows on disk. This way you can get 
all rows with the same gene_ref in 1 seek instead of 2000. Clustered 
tables also make Bitmap scan happy.


In my opinion this is the one that will make the most difference. You 
will need to run:


CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key;

after you insert significant amounts of data into the table. This 
re-orders the table according to the index, but new data is always 
written out of order, so after adding lots more data the table will 
need to be re-clustered again.


- Switch to a RAID10 (4 times the IOs per second, however zero gain 
if you're single-threaded, but massive gain when concurrent)


Greg Stark has a patch in the pipeline that will change this, for 
bitmap index scans, by using fadvise(), so a single thread can utilise 
multiple discs in a RAID array.


Matthew



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] large tables and simple = constant queries using indexes

2008-04-10 Thread John Beaver




Thanks Eric and Gaestano - interesting, and both examples of my
naivite. :)

I tried running large select(*) queries on other tables followed by
another try at the offending query, and it was still fast. Just to be
absolutely sure this is a scalable solution, I'll try restarting my
computer in a few hours to see if it affects anything cache-wise.


Gaetano Mendola wrote:

  John Beaver wrote:

  
  
- Trying the same constant a second time gave an instantaneous result,
I'm guessing because of query/result caching.

  
  
AFAIK no query/result caching is in place in postgres, what you are experiencing
is OS disk/memory caching.


Regards
Gaetano Mendola


  





[PERFORM] large tables and simple = constant queries using indexes

2008-04-09 Thread John Beaver
Hi, I've started my first project with Postgres (after several years of 
using Mysql), and I'm having an odd performance problem that I was 
hoping someone might be able to explain the cause of.


My query
   - select count(*) from gene_prediction_view where gene_ref = 523
   - takes 26 seconds to execute, and returns 2400 (out of a total of 
15 million records in the table)


---My problem---
   Using a single-column index to count 2400 records which are exactly 
one constant value doesn't sound like something that would take 26 
seconds. What's the slowdown? Any silver bullets that might fix this?


Steps I've taken
   - I ran vacuum and analyze
   - I upped the shared_buffers to 58384, and I upped some of the other 
postgresql.conf values as well. Nothing seemed to help significantly, 
but maybe I missed something that would help specifically for this query 
type?
   - I tried to create a hash index, but gave up after more than 4 
hours of waiting for it to finish indexing


Table stats
   - 15 million rows; I'm expecting to have four or five times this 
number eventually.

   - 1.5 gigs of hard drive usage

My development environment---
   - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm 
hard drive

   - OS X 10.5.2
   - Postgres 8.3 (installed via MacPorts)

My table

CREATE TABLE gene_prediction_view
(
 id serial NOT NULL,
 gene_ref integer NOT NULL,
 go_id integer NOT NULL,
 go_description character varying(200) NOT NULL,
 go_category character varying(50) NOT NULL,
 function_verified_exactly boolean NOT NULL,
 function_verified_with_parent_go boolean NOT NULL,
 function_verified_with_child_go boolean NOT NULL,
 score numeric(10,2) NOT NULL,
 precision_score numeric(10,2) NOT NULL,
 CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id),
 CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref)
 REFERENCES sgd_annotations (id) MATCH SIMPLE
 ON UPDATE NO ACTION ON DELETE NO ACTION,
 CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id)
 REFERENCES go_terms (term) MATCH SIMPLE
 ON UPDATE NO ACTION ON DELETE NO ACTION,
 CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id)
)
WITH (OIDS=FALSE);
ALTER TABLE gene_prediction_view OWNER TO postgres;

CREATE INDEX ix_gene_prediction_view_gene_ref
 ON gene_prediction_view
 USING btree
 (gene_ref);




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance