Re: [PERFORM] optimizing query with multiple aggregates

2009-10-29 Thread Doug Cole
On Thu, Oct 22, 2009 at 6:22 AM, Kenneth Marshall k...@rice.edu wrote:
 On Wed, Oct 21, 2009 at 03:51:25PM -0700, Doug Cole wrote:
 I have a reporting query that is taking nearly all of it's time in aggregate
 functions and I'm trying to figure out how to optimize it.  The query takes
 approximately 170ms when run with select *, but when run with all the
 aggregate functions the query takes 18 seconds.  The slowness comes from our
 attempt to find distribution data using selects of the form:

 SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)

 repeated across many different x,y values and fields to build out several
 histograms of the data.  The main culprit appears to be the CASE statement,
 but I'm not sure what to use instead.  I'm sure other people have had
 similar queries and I was wondering what methods they used to build out data
 like this?
 Thanks for your help,
 Doug

 Hi Doug,

 Have you tried using the width_bucket() function? Here is a nice
 article describing its use for making histograms:

 http://quantmeditate.blogspot.com/2005/03/creating-histograms-using-sql-function.html

 Regards,
 Ken


Thanks Ken,
  I ended up going with this approach - it meant I had to break it
into a lot more queries, one for each histogram, but even with that
added overhead I cut the time down from 18 seconds to right around 1
second.
Doug

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-26 Thread Marc Mamin
Hello,
 
I didn't try it, but following should be slightly faster:
 
COUNT( CASE WHEN field = x AND field  y THEN true END)
intead of 
SUM( CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
HTH,
 
Marc Mamin




From: pgsql-performance-ow...@postgresql.org
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Nikolas
Everett
Sent: Thursday, October 22, 2009 4:48 AM
To: Doug Cole
Cc: pgsql-performance
Subject: Re: [PERFORM] optimizing query with multiple aggregates


So you've got a query like: 

SELECT SUM(CASE WHEN field = 0 AND field  10 THEN 1 ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field = 10 AND field  20 THEN 1 ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field = 20 AND field  30 THEN 1 ELSE 0
END) as tenToTwenty,
...
FROM  bigtable




My guess is this forcing a whole bunch of if checks and your getting cpu
bound.  Could you try something like:


SELECT SUM(CASE WHEN field = 0 AND field  10 THEN count ELSE 0 END) as
zeroToTen,
  SUM(CASE WHEN field = 10 AND field  20 THEN count ELSE 0
END) as tenToTwenty,
  SUM(CASE WHEN field = 20 AND field  30 THEN count ELSE 0
END) as tenToTwenty,
...
FROM  (SELECT field, count(*) FROM bigtable GROUP BY field)


which will allow a hash aggregate?  You'd do a hash aggregate on the
whole table which should be quick and then you'd summarize your bins.


This all supposes that you don't want to just query postgres's column
statistics.


On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole dougc...@gmail.com wrote:


On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure
mmonc...@gmail.com wrote:

 On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole
dougc...@gmail.com wrote:
  I have a reporting query that is taking nearly all of it's
time in aggregate
  functions and I'm trying to figure out how to optimize it.
The query takes
  approximately 170ms when run with select *, but when run
with all the
  aggregate functions the query takes 18 seconds.  The
slowness comes from our
  attempt to find distribution data using selects of the form:
 
  SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
  repeated across many different x,y values and fields to
build out several
  histograms of the data.  The main culprit appears to be the
CASE statement,
  but I'm not sure what to use instead.  I'm sure other people
have had
  similar queries and I was wondering what methods they used
to build out data
  like this?

 have you tried:

 count(*) where field = x AND field  y;

 ??

 merlin


Unless I'm misunderstanding you, that would require breaking
each bin
into a separate sql statement and since I'm trying to calculate
more
than 100 bins between the different fields any improvement in
the
aggregate functions would be overwhelmed by the cost of the
actual
query, which is about 170ms.
Thanks,
Doug


--
Sent via pgsql-performance mailing list
(pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance





Re: [PERFORM] optimizing query with multiple aggregates

2009-10-22 Thread Robert Haas
On Wed, Oct 21, 2009 at 10:21 PM, Doug Cole dougc...@gmail.com wrote:
 On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole dougc...@gmail.com wrote:
  I have a reporting query that is taking nearly all of it's time in 
  aggregate
  functions and I'm trying to figure out how to optimize it.  The query takes
  approximately 170ms when run with select *, but when run with all the
  aggregate functions the query takes 18 seconds.  The slowness comes from 
  our
  attempt to find distribution data using selects of the form:
 
  SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
  repeated across many different x,y values and fields to build out several
  histograms of the data.  The main culprit appears to be the CASE statement,
  but I'm not sure what to use instead.  I'm sure other people have had
  similar queries and I was wondering what methods they used to build out 
  data
  like this?

 have you tried:

 count(*) where field = x AND field  y;

 ??

 merlin

 Unless I'm misunderstanding you, that would require breaking each bin
 into a separate sql statement and since I'm trying to calculate more
 than 100 bins between the different fields any improvement in the
 aggregate functions would be overwhelmed by the cost of the actual
 query, which is about 170ms.

Well, you might be able to use subselects to fetch all the results in
a single query, but it might still be slow.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-22 Thread Scott Carey



On 10/21/09 3:51 PM, Doug Cole dougc...@gmail.com wrote:

 I have a reporting query that is taking nearly all of it's time in aggregate
 functions and I'm trying to figure out how to optimize it.  The query takes
 approximately 170ms when run with select *, but when run with all the
 aggregate functions the query takes 18 seconds.  The slowness comes from our
 attempt to find distribution data using selects of the form:
 
 SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
 repeated across many different x,y values and fields to build out several
 histograms of the data.  The main culprit appears to be the CASE statement,
 but I'm not sure what to use instead.  I'm sure other people have had similar
 queries and I was wondering what methods they used to build out data like
 this?

You might be able to do this with plain aggregates.  Define a function that
generates your partitions that you can group by, then aggregate functions
for the outputs

In either case, rather than each result being a column in one result row,
each result will be its own row.

Each row would have a column that defines the type of the result (that you
grouped on), and one with the result value.  If each is just a sum, its
easy.  If there are lots of different calculation types, it would be harder.
Potentially, you could wrap that in a subselect to pull out each into its
own column but that is a bit messy.

Also, in 8.4 window functions could be helpful.  PARTITION BY something that
represents your buckets perhaps?
http://developer.postgresql.org/pgdocs/postgres/tutorial-window.html

This will generally force a sort, but shouldn't be that bad.

The function used for the group by or partition by could just be a big case
statement to generate a unique int per bucket, or a truncate/rounding
function.  It just needs to spit out a unique result for each bucket for the
group or partition.


 Thanks for your help,
 Doug
 


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Doug Cole
I have a reporting query that is taking nearly all of it's time in aggregate
functions and I'm trying to figure out how to optimize it.  The query takes
approximately 170ms when run with select *, but when run with all the
aggregate functions the query takes 18 seconds.  The slowness comes from our
attempt to find distribution data using selects of the form:

SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)

repeated across many different x,y values and fields to build out several
histograms of the data.  The main culprit appears to be the CASE statement,
but I'm not sure what to use instead.  I'm sure other people have had
similar queries and I was wondering what methods they used to build out data
like this?
Thanks for your help,
Doug


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Merlin Moncure
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole dougc...@gmail.com wrote:
 I have a reporting query that is taking nearly all of it's time in aggregate
 functions and I'm trying to figure out how to optimize it.  The query takes
 approximately 170ms when run with select *, but when run with all the
 aggregate functions the query takes 18 seconds.  The slowness comes from our
 attempt to find distribution data using selects of the form:

 SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)

 repeated across many different x,y values and fields to build out several
 histograms of the data.  The main culprit appears to be the CASE statement,
 but I'm not sure what to use instead.  I'm sure other people have had
 similar queries and I was wondering what methods they used to build out data
 like this?

have you tried:

count(*) where field = x AND field  y;

??

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread Doug Cole
On Wed, Oct 21, 2009 at 5:39 PM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole dougc...@gmail.com wrote:
  I have a reporting query that is taking nearly all of it's time in aggregate
  functions and I'm trying to figure out how to optimize it.  The query takes
  approximately 170ms when run with select *, but when run with all the
  aggregate functions the query takes 18 seconds.  The slowness comes from our
  attempt to find distribution data using selects of the form:
 
  SUM(CASE WHEN field = x AND field  y THEN 1 ELSE 0 END)
 
  repeated across many different x,y values and fields to build out several
  histograms of the data.  The main culprit appears to be the CASE statement,
  but I'm not sure what to use instead.  I'm sure other people have had
  similar queries and I was wondering what methods they used to build out data
  like this?

 have you tried:

 count(*) where field = x AND field  y;

 ??

 merlin

Unless I'm misunderstanding you, that would require breaking each bin
into a separate sql statement and since I'm trying to calculate more
than 100 bins between the different fields any improvement in the
aggregate functions would be overwhelmed by the cost of the actual
query, which is about 170ms.
Thanks,
Doug

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] optimizing query with multiple aggregates

2009-10-21 Thread David Wilson
On Wed, Oct 21, 2009 at 6:51 PM, Doug Cole dougc...@gmail.com wrote:


 repeated across many different x,y values and fields to build out several
 histograms of the data.  The main culprit appears to be the CASE statement,
 but I'm not sure what to use instead.  I'm sure other people have had
 similar queries and I was wondering what methods they used to build out data
 like this?


Use group by with an appropriate division/rounding to create the appropriate
buckets, if they're all the same size.

select round(field/100) as bucket, count(*) as cnt from foo group by
round(field/100);

-- 
- David T. Wilson
david.t.wil...@gmail.com