Re: [PERFORM] queries with lots of UNIONed relations

2011-01-14 Thread Vitalii Tymchyshyn

14.01.11 00:26, Tom Lane написав(ла):

Robert Haas  writes:

On Thu, Jan 13, 2011 at 3:12 PM, Jon Nelson  wrote:

I still think that having UNION do de-duplication of each contributory
relation is a beneficial thing to consider -- especially if postgresql
thinks the uniqueness is not very high.

This might be worth a TODO.

I don't believe there is any case where hashing each individual relation
is a win compared to hashing them all together.  If the optimizer were
smart enough to be considering the situation as a whole, it would always
do the latter.


How about cases when individual relations are already sorted? This will 
mean that they can be deduplicated fast and in streaming manner. Even 
partial sort order may help because you will need to deduplicate only 
groups with equal sorted fields, and this will take much less memory and 
be much more streaming. And if all individual deduplications are 
streaming and are sorted in one way - you can simply do a merge on top.


Best regards, Vitalii Tymchyshyn.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] plan question - query with order by and limit not choosing index depends on size of limit, table

2011-01-14 Thread Mike Broers
Thanks Robert, this is what I was looking for.  I will try these suggestions
and follow up if any of them are the silver bullet.

On Fri, Jan 14, 2011 at 7:11 AM, Robert Haas wrote:

> On Thu, Jan 6, 2011 at 4:36 PM, Mike Broers  wrote:
> > Thanks for the assistance.
> > Here is an explain analyze of the query with the problem limit:
> > production=# explain analyze select * from landing_page.messages where
> > ((messages.topic = E'x') AND (messages.processed = 'f'))  ORDER BY
> > messages.created_at ASC limit 10;
> >
> >QUERY PLAN
> >
> >
> --
> > 
> >  Limit  (cost=0.00..2891.06 rows=10 width=1340) (actual
> > time=207922.586..207922.586 rows=0 loops=1)
> >->  Index Scan using idx_landing_page_messages_created_at on messages
> >  (cost=0.00..449560.48 rows=1555 widt
> > h=1340) (actual time=207922.581..207922.581 rows=0 loops=1)
> >  Filter: ((NOT processed) AND ((topic)::text = 'x'::text))
> >  Total runtime: 207949.413 ms
> > (4 rows)
>
> You're not the first person to have been bitten by this.  The
> optimizer thinks that rows WHERE NOT processed and topic = 'x' are
> reasonably common, so it figures that it can just index scan until it
> finds 10 of them.  But when it turns out that there are none at all,
> it ends up having to scan the entire index, which stinks big-time.
>
> The alternative plan is to use a different index to find ALL the
> relevant rows, sort them, and then take the top 10.   That would suck
> if there actually were tons of rows like this, but there aren't.
>
> So the root of the problem, in some sense, is that the planner's
> estimate of the selectivity of "NOT processed and topic = 'x'" is not
> very good.  Some things to try:
>
> - increase the statistics target for the "processed" and "topic"
> columns even higher
> - put the processed rows in one table and the not processed rows in
> another table
> - do something like SELECT * FROM (SELECT .. LIMIT 200 OFFSET 0) LIMIT
> 10 to try to fool the planner into planning based on the higher, inner
> limit
> - create a partial index on messages (topic) WHERE NOT processed and
> see if the planner will use it
>
> ...Robert
>


Re: [PERFORM] The good, old times

2011-01-14 Thread Mladen Gogala

Craig Ringer wrote:

On 01/12/2011 10:16 PM, Guillaume Cottenceau wrote:

  

What's your point and in what is it related to that ML?



Given the package names, I suspect this is a poorly-expressed complaint 
about the performance of downloads from the pgdg/psqlrpms site. If that 
was the original poster's intent, they would've been better served with 
a post that included some minimal details like:
  

Yes, it was a complaint about the download speed.



- Information abut their local connectivity
- mtr --report / traceroute output
- tests from other available hosts
  


As for the traceroute information, here it is:
traceroute yum.pgrpms.org
traceroute to yum.pgrpms.org (77.79.103.58), 30 hops max, 40 byte packets
1  216.169.135.254 (216.169.135.254)  0.389 ms  0.404 ms  0.451 ms
2  host189.131.26.216.vmsinfo.com (216.26.131.189)  9.355 ms  9.357 ms  
9.368 ms

3  v11.lc2.lou.peak10.net (216.26.190.10)  9.645 ms  9.645 ms  9.637 ms
4  ge-7-41.car1.Cincinnati1.Level3.net (4.53.64.41)  13.002 ms  13.002 
ms  13.018 ms
5  ae-2-5.bar1.Cincinnati1.Level3.net (4.69.132.206)  13.101 ms  13.098 
ms  13.087 ms
6  ae-10-10.ebr2.Chicago1.Level3.net (4.69.136.214)  22.096 ms  21.358 
ms  21.329 ms
7  ae-1-100.ebr1.Chicago1.Level3.net (4.69.132.41)  27.729 ms  10.812 
ms  24.132 ms
8  ae-2-2.ebr2.NewYork2.Level3.net (4.69.132.66)  34.008 ms  33.960 ms  
34.088 ms
9  ae-1-100.ebr1.NewYork2.Level3.net (4.69.135.253)  34.152 ms  35.353 
ms  37.068 ms
10  ae-4-4.ebr1.NewYork1.Level3.net (4.69.141.17)  36.998 ms  37.248 ms  
36.986 ms
11  ae-43-43.ebr2.London1.Level3.net (4.69.137.73)  107.031 ms 
ae-42-42.ebr2.London1.Level3.net (4.69.137.69)  104.624 ms  107.000 ms
12  ae-2-52.edge4.London1.Level3.net (4.69.139.106)  107.506 ms  106.993 
ms  180.229 ms

13   (195.50.122.174)  168.849 ms  160.917 ms  161.713 ms
14  static.turktelekom.com.tr (212.156.103.42)  176.503 ms  179.012 ms  
179.394 ms
15  gayrettepe-t3-1-gayrettepe-t2-1.turktelekom.com.tr (212.156.118.29)  
167.867 ms  167.870 ms  167.862 ms

16  88.255.240.110 (88.255.240.110)  167.515 ms  168.172 ms  165.829 ms
17  ns1.gunduz.org (77.79.103.58)  171.574 ms !X * *
[mgogala@lpo-postgres-d01 ~]$

Are there any good mirrors?  Apparently, there is something slow in the 
force.





If that wasn't the original poster's intent, perhaps it'd be worth a 
second try to explain what they were *trying* to say? Was it just a joke 
- 'cos if so, it was kinda flat.


--
Craig Ringer

  



--

Mladen Gogala 
Sr. Oracle DBA

1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com 
The Leader in Integrated Media Intelligence Solutions





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] plan question - query with order by and limit not choosing index depends on size of limit, table

2011-01-14 Thread Robert Haas
On Fri, Jan 14, 2011 at 11:36 AM, Mike Broers  wrote:
> Thanks Robert, this is what I was looking for.  I will try these suggestions
> and follow up if any of them are the silver bullet.

No problem - and sorry for the off-list reply.  I was a little sleepy
when I wrote that; thanks for getting it back on-list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] "SELECT .. WHERE NOT IN" query running for hours

2011-01-14 Thread Robert Haas
2011/1/10 Mladen Gogala :
> Well, I really hoped that Bruce, Robert or Greg would take on this one, but
> since there are no more qualified takers, I'll take a shot at this one. For
> the "NOT IN (result of a correlated sub-query)", the sub-query needs to be
> executed for every row matching the conditions on the driving table, while
> the   !EXISTS is just a complement of join. It's all in the basic set theory
> which serves as a model for the relational databases.

As Scott says, the real problem is the NULL handling.  The semantics
are otherwise similar.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Problems with FTS

2011-01-14 Thread Robert Haas
On Tue, Jan 11, 2011 at 3:16 AM, Rauan Maemirov  wrote:
> Hi, Kevin.
> Sorry for long delay.
> EXPLAIN ANALYZE SELECT "v"."id", "v"."title" FROM "video" AS "v"
> WHERE (v.active) AND (v.fts @@
> 'dexter:A|season:A|seri:A|декстер:A|качество:A|сезон:A|серия:A'::tsquery and
> v.id <> 500563 )
> ORDER BY COALESCE(ts_rank_cd( '{0.1, 0.2, 0.7, 1.0}', v.fts,
> 'dexter:A|season:A|seri:A|декстер:A|качество:A|сезон:A|серия:A'::tsquery),
> 1) DESC, v.views DESC
> LIMIT 6
> "Limit  (cost=103975.50..103975.52 rows=6 width=280) (actual
> time=2893.193..2893.199 rows=6 loops=1)"
> "  ->  Sort  (cost=103975.50..104206.07 rows=92228 width=280) (actual
> time=2893.189..2893.193 rows=6 loops=1)"
> "        Sort Key: (COALESCE(ts_rank_cd('{0.1,0.2,0.7,1}'::real[], fts, '( (
> ( ( ( ''dexter'':A | ''season'':A ) | ''seri'':A ) | ''декстер'':A ) |
> ''качество'':A ) | ''сезон'':A ) | ''серия'':A'::tsquery), 1::real)), views"
> "        Sort Method:  top-N heapsort  Memory: 25kB"
> "        ->  Seq Scan on video v  (cost=0.00..102322.34 rows=92228
> width=280) (actual time=0.100..2846.639 rows=54509 loops=1)"
> "              Filter: (active AND (fts @@ '( ( ( ( ( ''dexter'':A |
> ''season'':A ) | ''seri'':A ) | ''декстер'':A ) | ''качество'':A ) |
> ''сезон'':A ) | ''серия'':A'::tsquery) AND (id <> 500563))"
> "Total runtime: 2893.264 ms"
> Table scheme:
> CREATE TABLE video
> (
>   id bigserial NOT NULL,
>   hash character varying(12),
>   account_id bigint NOT NULL,
>   category_id smallint NOT NULL,
>   converted boolean NOT NULL DEFAULT false,
>   active boolean NOT NULL DEFAULT true,
>   title character varying(255),
>   description text,
>   tags character varying(1000),
>   authorized boolean NOT NULL DEFAULT false,
>   adult boolean NOT NULL DEFAULT false,
>   views bigint DEFAULT 0,
>   rating real NOT NULL DEFAULT 0,
>   screen smallint DEFAULT 2,
>   duration smallint,
>   "type" smallint DEFAULT 0,
>   mp4 smallint NOT NULL DEFAULT 0,
>   size bigint,
>   size_high bigint DEFAULT 0,
>   source character varying(255),
>   storage_id smallint NOT NULL DEFAULT 1,
>   rule_watching smallint,
>   rule_commenting smallint,
>   count_comments integer NOT NULL DEFAULT 0,
>   count_likes integer NOT NULL DEFAULT 0,
>   count_faves integer NOT NULL DEFAULT 0,
>   fts tsvector,
>   modified timestamp without time zone NOT NULL DEFAULT now(),
>   created timestamp without time zone DEFAULT now(),
>   CONSTRAINT video_pkey PRIMARY KEY (id),
>   CONSTRAINT video_hash_key UNIQUE (hash)
> )
> WITH (
>   OIDS=FALSE
> );
> Indexes:
> CREATE INDEX idx_video_account_id  ON video  USING btree  (account_id);
> CREATE INDEX idx_video_created  ON video  USING btree  (created);
> CREATE INDEX idx_video_fts  ON video  USING gin  (fts);
> CREATE INDEX idx_video_hash  ON video  USING hash  (hash);
> (here I tried both gist and gin indexes)
> I have 32Gb ram and 2 core quad E5520, 2.27GHz (8Mb cache).
> Pgsql conf:
> max_connections = 200
> shared_buffers = 7680MB
> work_mem = 128MB
> maintenance_work_mem = 1GB
> effective_cache_size = 22GB
> default_statistics_target = 100
> Anything else?

For returning that many rows, an index scan might actually be slower.
Maybe it's worth testing.  Try:

SET enable_seqscan=off;
EXPLAIN ANALYZE ...

and see what you get.  If it's slower, well, then be happy it didn't
use the index (maybe the question is... what index should you have
instead?).  If it's faster, post the results...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] "COPY TO stdout" statements occurrence in log files

2011-01-14 Thread Fernando Mertins
Hi there,
I have one log file per week and logging all statements >= 500 ms execution
time.
But, with "normal" statements are occuring something like this:

2011-01-13 00:11:38 BRT LOG:  duration: 2469.000 ms  statement: FETCH 1000
IN bloboid
2011-01-13 00:12:01 BRT LOG:  duration: 797.000 ms  statement: SELECT
tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid =
nspowner) as rolname, nspacl FROM pg_namespace
2011-01-13 00:12:06 BRT LOG:  duration: 766.000 ms  statement:
*COPY*public.log (codlog, matricula, data, descricao, codcurso, ip)
WITH OIDS
*TO stdout; *
2011-01-13 00:12:10 BRT LOG:  duration: 2328.000 ms  statement: FETCH 1000
IN bloboid
2011-01-13 00:12:34 BRT LOG:  duration: 594.000 ms  statement: SELECT
tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid =
nspowner) as rolname, nspacl FROM pg_namespace
2011-01-13 00:12:38 BRT LOG:  duration: 672.000 ms  statement:
*COPY*public.avaliacao_topico_opcao (codavaliacao_topico_opcao,
codavaliacao_topico, descricao, selecao) WITH OIDS *TO stdout; *
2011-01-13 00:12:39 BRT LOG:  duration: 891.000 ms  statement: COPY
public.log (codlog, matricula, data, descricao, codcurso, ip) WITH OIDS TO
stdout;

Is this normal? I'm afraid because my application doesn't run this kind of
statement, so how can I know what is doing these commands? Maybe pg_dump?

Thank you!
Fernando


Re: [PERFORM] queries with lots of UNIONed relations

2011-01-14 Thread Jon Nelson
On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane  wrote:
> Jon Nelson  writes:
>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane  wrote:
>>> If you have enough memory to de-dup them individually, you surely have
>>> enough to de-dup all at once.
>
>> If everything were available up-front, sure.
>> However, and please correct me if I'm wrong, but doesn't postgresql
>> work in a fairly linear fashion, moving from table to table performing
>> a series of operations on each?
>
> Doing a single sort+uniq works like that.  But the alternate plan you
> are proposing we should consider involves building all the lower
> hashtables, and then reading from them to fill the upper hashtable.
> Max memory consumption *is* worst case here.  Remember HashAggregate
> is incapable of swapping to disk (and if it did, you wouldn't be nearly
> as pleased with its performance).

That's not exactly what I'm proposing - but it is probably due to a
lack of understanding some of the underlying details of how postgresql
works. I guess I had assumed that the result of a HashAggregate or any
other de-duplication process was a table-like structure.

Regarding being pleased with hash aggregate - I am! - except when it
goes crazy and eats all of the memory in the machine. I'd trade a bit
of performance loss for not using up all of the memory and crashing.

However, maybe I'm misunderstanding how SELECT DISTINCT works internally.
In the case where a hashtable is used, does postgresql utilize
table-like structure or does it remain a hashtable in memory?

If it's a hashtable, couldn't the hashtable be built on-the-go rather
than only after all of the underlying tuples are available?

I'd love a bit more explanation as to how this works.

Another example of where this might be useful:   I'm currently running
a SELECT DISTINCT query over some 500 million rows (120 contributory
tables). I expect a de-duplicated row count of well under 10% of that
500 million, probably below 1%. The plan as it stands is to execute a
series of sequential scans, appending each of the rows from each
contributory table and then aggregating them. If the expected
distinctness of each contributory subquery is, say, 5% then instead of
aggregating over 500 million tuples the aggregation would take place
over 25 million. In this case, that is a savings of 10 gigabytes,
approximately.

Yes, it's true, the same amount of data has to be scanned. However,
the amount of data that needs to be stored (in memory or on disk) in
order to provide a final de-duplication is much smaller.

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] "COPY TO stdout" statements occurrence in log files

2011-01-14 Thread Maciek Sakrejda
> Is this normal? I'm afraid because my application doesn't run this kind of
> statement, so how can I know what is doing these commands? Maybe pg_dump?

I think pg_dump is likely, yes, if you have that scheduled. I don't
think anything in the log file will identify it as pg_dump explicitly
(I believe as far as the server is concerned, pg_dump is just another
client), but if you're concerned about this, you can add the client
pid (%p) to log_line_prefix in postgresql.conf, log the pg_dump pid
through whatever mechanism manages that, and compare.

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
msakre...@truviso.com
www.truviso.com

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Nikolas Everett
I'm using 8.3 and I have a table that contains many revisions of the same
entity and I have a query that is super slow.  Please help!  I'm going to
paste in some SQL to set up test cases and some plans below.  If that isn't
the right way to post to this list please let me know and I'll revise.

My table looks kind of like this but wider:
CREATE TEMPORARY TABLE test (revision SERIAL NOT NULL PRIMARY KEY, a INTEGER
NOT NULL, b INTEGER NOT NULL, c INTEGER NOT NULL);
INSERT INTO test (a, b, c) SELECT a, 1, 25 FROM generate_series(1, 10)
AS t1(a), generate_series(1, 10) as t2(b);
CREATE INDEX test_a ON test (a);
ANALYZE test;

I need to SELECT all the columns with the latest revision for a subset of
As.  What is the right way to do this quickly?

When I do it like this:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT *
  FROM request
  JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
  JOIN test USING (a, b);
DROP TABLE request;

The plan for the SELECT is pretty bad:
"Hash Join  (cost=32792.50..77907.29 rows=62288 width=20) (actual
time=769.570...050 rows=199 loops=1)"
"  Hash Cond: ((max(pg_temp_7.test.revision)) = pg_temp_7.test.revision)"
"  ->  Hash Join  (cost=5.48..38659.23 rows=62288 width=8) (actual
time=20.621..830.235 rows=199 loops=1)"
"Hash Cond: (pg_temp_7.test.a = request.a)"
"->  GroupAggregate  (cost=0.00..37170.11 rows=62601 width=8)
(actual time=16.847..808.475 rows=10 loops=1)"
"  ->  Index Scan using test_a on test  (cost=0.00..31388.04
rows=12 width=8) (actual time=16.826..569.035 rows=100 loops=1)"
"->  Hash  (cost=2.99..2.99 rows=199 width=4) (actual
time=3.736..3.736 rows=199 loops=1)"
"  ->  Seq Scan on request  (cost=0.00..2.99 rows=199 width=4)
(actual time=3.658..3.689 rows=199 loops=1)"
"  ->  Hash  (cost=15405.12..15405.12 rows=12 width=16) (actual
time=723.673..723.673 rows=100 loops=1)"
"->  Seq Scan on test  (cost=0.00..15405.12 rows=12 width=16)
(actual time=0.006..290.313 rows=100 loops=1)"
"Total runtime: .267 ms"

If I instead issue the query as:
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL, revision INTEGER);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
UPDATE request SET revision = (SELECT MAX(revision) FROM test WHERE
request.a = test.a);
ANALYZE request;
SELECT *
  FROM request
  JOIN test USING (revision)
DROP TABLE request;

The whole thing runs tons faster.  The UPDATE uses the right index and is
way sub second and the SELECT's plan is fine:
"Merge Join  (cost=11.66..76.09 rows=199 width=20) (actual time=0.131..0.953
rows=199 loops=1)"
"  Merge Cond: (test.revision = request.revision)"
"  ->  Index Scan using test_pkey on test  (cost=0.00..31388.04 rows=12
width=16) (actual time=0.017..0.407 rows=2001 loops=1)"
"  ->  Sort  (cost=11.59..12.09 rows=199 width=8) (actual time=0.102..0.133
rows=199 loops=1)"
"Sort Key: request.revision"
"Sort Method:  quicksort  Memory: 34kB"
"->  Seq Scan on request  (cost=0.00..3.99 rows=199 width=8) (actual
time=0.020..0.050 rows=199 loops=1)"
"Total runtime: 1.005 ms"

Am I missing something or is this really the best way to do this in 8.3?

Thanks for slogging through all this,

Nik Everett


Re: [PERFORM] "COPY TO stdout" statements occurrence in log files

2011-01-14 Thread Chris Browne
msakre...@truviso.com (Maciek Sakrejda) writes:
>> Is this normal? I'm afraid because my application doesn't run this kind of
>> statement, so how can I know what is doing these commands? Maybe pg_dump?
>
> I think pg_dump is likely, yes, if you have that scheduled. I don't
> think anything in the log file will identify it as pg_dump explicitly
> (I believe as far as the server is concerned, pg_dump is just another
> client), but if you're concerned about this, you can add the client
> pid (%p) to log_line_prefix in postgresql.conf, log the pg_dump pid
> through whatever mechanism manages that, and compare.

That's an option...  More are possible...

1.  Our DBAs have been known to create users specifically for doing
backups ("dumpy").  It doesn't seem like a *huge* proliferation of users
to have some 'utility' user names for common processes.

2.  In 9.1, there will be a new answer, as there's a GUC to indicate the
"application_name".
-- 
"Programming today  is a race  between software engineers  striving to
build bigger and better  idiot-proof programs, and the Universe trying
to  produce  bigger  and  better  idiots.  So  far,  the  Universe  is
winning."  -- Rich Cook

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Kevin Grittner
Nikolas Everett  wrote:
 
> Am I missing something or is this really the best way to do this in
8.3?
 
How about this?:
 
SELECT y.*
  from (select a, max(revision) as revision
  from test where a between 2 and 200
  group by a) x
  join test y using (a, revision);
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Nikolas Everett
On Fri, Jan 14, 2011 at 5:30 PM, Kevin Grittner  wrote:

> SELECT y.*
>  from (select a, max(revision) as revision
>  from test where a between 2 and 200
>  group by a) x
>  join test y using (a, revision);


While certainly simpler than my temp table this really just exposes a flaw
in my example - I'm really going to be doing this with an arbitrary list of
As.


Re: [PERFORM] "COPY TO stdout" statements occurrence in log files

2011-01-14 Thread Marti Raudsepp
On Fri, Jan 14, 2011 at 23:19, Chris Browne  wrote:
> 2.  In 9.1, there will be a new answer, as there's a GUC to indicate the
> "application_name".

Actually this was already introduced in PostgreSQL 9.0 :)

You can add application_name to your log_line_prefix with %a. For
pg_dump it will display "pg_dump"

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Kevin Grittner
Nikolas Everett  wrote:
 
> I'm really going to be doing this with an arbitrary list of As.
 
OK, how about this?:
 
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT y.*
  from (select a, max(revision) as revision
  from test join request using (a)
  group by a) x
  join test y using (a, revision);
DROP TABLE request;
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Shaun Thomas

On 01/14/2011 03:17 PM, Nikolas Everett wrote:


SELECT *
   FROM request
   JOIN (SELECT a, MAX(b) as b FROM test GROUP BY a) max USING (a)
   JOIN test USING (a, b);


This actually looks like a perfect candidate for DISTINCT ON.

SELECT DISTINCT ON (a, b) a, b, revision
  FROM test
 ORDER BY a, b DESC;

Maybe I'm just misunderstanding your situation, though.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604
312-676-8870
stho...@peak6.com

__

See  http://www.peak6.com/email_disclaimer.php
for terms and conditions related to this email

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Kevin Grittner
Shaun Thomas  wrote:
 
> This actually looks like a perfect candidate for DISTINCT ON.
> 
> SELECT DISTINCT ON (a, b) a, b, revision
>FROM test
>   ORDER BY a, b DESC;
 
I wouldn't say perfect.  It runs about eight times slower than what
I suggested and returns a fairly random value for revision instead
of the max(revision).
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Tom Lane
"Kevin Grittner"  writes:
> Shaun Thomas  wrote:
>> This actually looks like a perfect candidate for DISTINCT ON.
>> 
>> SELECT DISTINCT ON (a, b) a, b, revision
>> FROM test
>> ORDER BY a, b DESC;
 
> I wouldn't say perfect.  It runs about eight times slower than what
> I suggested and returns a fairly random value for revision instead
> of the max(revision).

Shaun's example is a bit off: normally, when using DISTINCT ON, you want
an ORDER BY key that uses all the given DISTINCT keys and then some
more.  To get the max revision for each a/b combination it ought to be

SELECT DISTINCT ON (a, b) a, b, revision
   FROM test
   ORDER BY a, b, revision DESC;

As for speed, either one might be faster in a particular situation.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Kevin Grittner
Tom Lane  wrote:
 
> Shaun's example is a bit off
 
> As for speed, either one might be faster in a particular
> situation.
 
After fixing a mistake in my testing and learning from Tom's example
I generated queries against the OP's test data which produce
identical results, and I'm finding no significant difference between
run times for the two versions.  The OP should definitely try both
against the real tables.
 
Here are the queries which run against the test set:
 
DROP TABLE IF EXISTS request;
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT y.*
  from (select a, max(revision) as revision
  from test join request using (a)
  group by a) x
  join test y using (a, revision)
  order by a, revision DESC;
 
DROP TABLE IF EXISTS request;
CREATE TEMPORARY TABLE request (a INTEGER NOT NULL);
INSERT INTO request SELECT a FROM generate_series(2, 200) AS t(a);
ANALYZE request;
SELECT DISTINCT ON (a, b, c) revision, a, b, c
   FROM test join request using (a)
   ORDER BY a, b, c, revision DESC;
 
Sorry for not sorting it out better initially.
 
-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Best way to get the latest revision from a table

2011-01-14 Thread Nikolas Everett
On Fri, Jan 14, 2011 at 7:59 PM, Kevin Grittner  wrote:

> Tom Lane  wrote:
>
> > Shaun's example is a bit off
>
> > As for speed, either one might be faster in a particular
> > situation.
>
> After fixing a mistake in my testing and learning from Tom's example
> I generated queries against the OP's test data which produce
> identical results, and I'm finding no significant difference between
> run times for the two versions.  The OP should definitely try both
> against the real tables.
>
> 

> -Kevin
>

After trying both against the real tables DISTINCT ON seems to be about two
orders of magnitude faster than the other options.

Thanks so much!

Nik Everett