Re: [PERFORM] Does the psql executable support a fetch many approach when dumping large queries to stdout?
Ah, you're right. Thanks Hannu, that's a good solution. Hannu Krosing wrote: On Fri, 2010-04-02 at 16:28 -0400, Beaver, John E wrote: ... I know that the query used here could have been a COPY statement, which I assume would be better-behaved, but I'm more concerned about the case in which the query is more complex. COPY can copy out results of a SELECT query as well. -- John E. Beaver Bioinformatics Developer Harvard Medical School
Re: *** PROBABLY SPAM *** [PERFORM] Does the psql executable support a fetch many approach when dumping large queries to stdout?
That makes sense. I'll just use a COPY statement instead like Hannu suggests. Pierre C wrote: Does the psql executable have any ability to do a "fetch many", using a server-side named cursor, when returning results? It seems like it tries to retrieve the query entirely to local memory before printing to standard out. I think it grabs the whole result set to calculate the display column widths. I think there is an option to tweak this but don't remember which, have a look at the psql commands (\?), formatting section. -- John E. Beaver Bioinformatics Developer Harvard Medical School
Re: [PERFORM] sequence scan problem
Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and I vacuumed-analyzed both tables directly after they were created. # explain analyze select fls.function_verified, fls.score, fls.go_category_group_ref, fs1.gene_ref, fs1.function_verified_exactly, fs2.gene_ref, fs2.function_verified_exactly from functional_linkage_scores fls, gene_prediction_view fs1, gene_prediction_view fs2 where fls.gene_ref1 = fs1.gene_ref and fls.gene_ref2 = fs2.gene_ref and fs1.go_term_ref = 2 and fs2.go_term_ref = 2; QUERY PLAN -- Merge Join (cost=1399203593.41..6702491234.74 rows=352770803726 width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1) Merge Cond: (fs2.gene_ref = fls.gene_ref2) - Index Scan using gene_prediction_view_gene_ref on gene_prediction_view fs2 (cost=0.00..12111899.77 rows=197899 width=5) (actual time=29.592..469838.583 rows=180629 loops=1) Index Cond: (go_term_ref = 2) - Materialize (cost=1399069432.20..1483728633.52 rows=6772736105 width=21) (actual time=6370164.864..16623552.417 rows=15610535121 loops=1) - Sort (cost=1399069432.20..1416001272.47 rows=6772736105 width=21) (actual time=6370164.860..13081970.248 rows=1897946790 loops=1) Sort Key: fls.gene_ref2 Sort Method: external merge Disk: 61192240kB - Merge Join (cost=40681244.97..154286110.62 rows=6772736105 width=21) (actual time=592112.778..2043161.851 rows=1897946790 loops=1) Merge Cond: (fs1.gene_ref = fls.gene_ref1) - Index Scan using gene_prediction_view_gene_ref on gene_prediction_view fs1 (cost=0.00..12111899.77 rows=197899 width=5) (actual time=0.015..246613.129 rows=180644 loops=1) Index Cond: (go_term_ref = 2) - Materialize (cost=40586010.10..43490582.70 rows=232365808 width=20) (actual time=592112.755..1121366.375 rows=1897946783 loops=1) - Sort (cost=40586010.10..41166924.62 rows=232365808 width=20) (actual time=592112.721..870349.308 rows=232241678 loops=1) Sort Key: fls.gene_ref1 Sort Method: external merge Disk: 7260856kB - Seq Scan on functional_linkage_scores fls (cost=0.00..3928457.08 rows=232365808 width=20) (actual time=14.221..86455.902 rows=232241678 loops=1) Total runtime: 24183346.271 ms (18 rows) Jeremy Harris wrote: John Beaver wrote: I'm having a strange problem with a query. The query is fairly simple, with a few constants and two joins. All relevant columns should be indexed, and I'm pretty sure there aren't any type conversion issues. But the query plan includes a fairly heavy seq scan. The only possible complication is that the tables involved are fairly large - hundreds of millions of rows each. Can anyone explain this? There should only ever be a maximum of about 50 rows returned when the query is executed. You didn't say when you last vacuumed? If there should only be 50 rows returned then the estimates from the planner are way out. If that doesn't help, we'll need version info, and (if you can afford the time) an explain analyze Cheers, Jeremy -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] sequence scan problem
chuckle You're right - for some reason I was looking at the (18 rows) at the bottom. Pilot error indeed - I'll have to figure out what's going on with my data. Thanks! Tom Lane wrote: John Beaver [EMAIL PROTECTED] writes: Ok, here's the explain analyze result. Again, this is Postgres 8.3.3 and I vacuumed-analyzed both tables directly after they were created. Merge Join (cost=1399203593.41..6702491234.74 rows=352770803726 width=22) (actual time=6370194.467..22991303.434 rows=15610535128 loops=1) ^^^ Weren't you saying that only 50 rows should be returned? I'm thinking the real problem here is pilot error: you missed out a needed join condition or something. SQL will happily execute underconstrained queries ... regards, tom lane
Re: [PERFORM] sequence scan problem
Oh, and the version is 8.3.3. Jeremy Harris wrote: John Beaver wrote: I'm having a strange problem with a query. The query is fairly simple, with a few constants and two joins. All relevant columns should be indexed, and I'm pretty sure there aren't any type conversion issues. But the query plan includes a fairly heavy seq scan. The only possible complication is that the tables involved are fairly large - hundreds of millions of rows each. Can anyone explain this? There should only ever be a maximum of about 50 rows returned when the query is executed. You didn't say when you last vacuumed? If there should only be 50 rows returned then the estimates from the planner are way out. If that doesn't help, we'll need version info, and (if you can afford the time) an explain analyze Cheers, Jeremy -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] PgPool parallel query performance rules of thumb
Hi, I'm trying to make use of a cluster of 40 nodes that my group has, and I'm curious if anyone has experience with PgPool's parallel query mode. Under what circumstances could I expect the most benefit from query parallelization as implemented by PgPool? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] large tables and simple = constant queries using indexes
Thanks a lot, all of you - this is excellent advice. With the data clustered and statistics at a more reasonable value of 100, it now reproducibly takes even less time - 20-57 ms per query. After reading the section on Statistics Used By the Planner in the manual, I was a little concerned that, while the statistics sped up the queries that I tried immeasurably, that the most_common_vals array was where the speedup was happening, and that the values which wouldn't fit in this array wouldn't be sped up. Though I couldn't offhand find an example where this occurred, the clustering approach seems intuitively like a much more complete and scalable solution, at least for a read-only table like this. As to whether the entire index/table was getting into ram between my statistics calls, I don't think this was the case. Here's the behavior that I found: - With statistics at 10, the query took 25 (or so) seconds no matter how many times I tried different values. The query plan was the same as for the 200 and 800 statistics below. - Trying the same constant a second time gave an instantaneous result, I'm guessing because of query/result caching. - Immediately on increasing the statistics to 200, the query took a reproducibly less amount of time. I tried about 10 different values - Immediately on increasing the statistics to 800, the query reproducibly took less than a second every time. I tried about 30 different values. - Decreasing the statistics to 100 and running the cluster command brought it to 57 ms per query. - The Activity Monitor (OSX) lists the relevant postgres process as taking a little less than 500 megs. - I didn't try decreasing the statistics back to 10 before I ran the cluster command, so I can't show the search times going up because of that. But I tried killing the 500 meg process. The new process uses less than 5 megs of ram, and still reproducibly returns a result in less than 60 ms. Again, this is with a statistics value of 100 and the data clustered by gene_prediction_view_gene_ref_key. And I'll consider the idea of using triggers with an ancillary table for other purposes; seems like it could be a useful solution for something. Matthew wrote: On Thu, 10 Apr 2008, PFC wrote: ... Lots of useful advice ... - If you often query rows with the same gene_ref, consider using CLUSTER to physically group those rows on disk. This way you can get all rows with the same gene_ref in 1 seek instead of 2000. Clustered tables also make Bitmap scan happy. In my opinion this is the one that will make the most difference. You will need to run: CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key; after you insert significant amounts of data into the table. This re-orders the table according to the index, but new data is always written out of order, so after adding lots more data the table will need to be re-clustered again. - Switch to a RAID10 (4 times the IOs per second, however zero gain if you're single-threaded, but massive gain when concurrent) Greg Stark has a patch in the pipeline that will change this, for bitmap index scans, by using fadvise(), so a single thread can utilise multiple discs in a RAID array. Matthew -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] large tables and simple = constant queries using indexes
Thanks Eric and Gaestano - interesting, and both examples of my naivite. :) I tried running large select(*) queries on other tables followed by another try at the offending query, and it was still fast. Just to be absolutely sure this is a scalable solution, I'll try restarting my computer in a few hours to see if it affects anything cache-wise. Gaetano Mendola wrote: John Beaver wrote: - Trying the same constant a second time gave an instantaneous result, I'm guessing because of query/result caching. AFAIK no query/result caching is in place in postgres, what you are experiencing is OS disk/memory caching. Regards Gaetano Mendola
[PERFORM] large tables and simple = constant queries using indexes
Hi, I've started my first project with Postgres (after several years of using Mysql), and I'm having an odd performance problem that I was hoping someone might be able to explain the cause of. My query - select count(*) from gene_prediction_view where gene_ref = 523 - takes 26 seconds to execute, and returns 2400 (out of a total of 15 million records in the table) ---My problem--- Using a single-column index to count 2400 records which are exactly one constant value doesn't sound like something that would take 26 seconds. What's the slowdown? Any silver bullets that might fix this? Steps I've taken - I ran vacuum and analyze - I upped the shared_buffers to 58384, and I upped some of the other postgresql.conf values as well. Nothing seemed to help significantly, but maybe I missed something that would help specifically for this query type? - I tried to create a hash index, but gave up after more than 4 hours of waiting for it to finish indexing Table stats - 15 million rows; I'm expecting to have four or five times this number eventually. - 1.5 gigs of hard drive usage My development environment--- - 2.6ghz dual-core MacBook Pro with 4 gigs of ram and a 7200 rpm hard drive - OS X 10.5.2 - Postgres 8.3 (installed via MacPorts) My table CREATE TABLE gene_prediction_view ( id serial NOT NULL, gene_ref integer NOT NULL, go_id integer NOT NULL, go_description character varying(200) NOT NULL, go_category character varying(50) NOT NULL, function_verified_exactly boolean NOT NULL, function_verified_with_parent_go boolean NOT NULL, function_verified_with_child_go boolean NOT NULL, score numeric(10,2) NOT NULL, precision_score numeric(10,2) NOT NULL, CONSTRAINT gene_prediction_view_pkey PRIMARY KEY (id), CONSTRAINT gene_prediction_view_gene_ref_fkey FOREIGN KEY (gene_ref) REFERENCES sgd_annotations (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT gene_prediction_view_go_id_fkey FOREIGN KEY (go_id) REFERENCES go_terms (term) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT gene_prediction_view_gene_ref_key UNIQUE (gene_ref, go_id) ) WITH (OIDS=FALSE); ALTER TABLE gene_prediction_view OWNER TO postgres; CREATE INDEX ix_gene_prediction_view_gene_ref ON gene_prediction_view USING btree (gene_ref); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance