Re: [GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-19 Thread Scott Marlowe
On 9/18/07, Ow Mun Heng [EMAIL PROTECTED] wrote:
 On Tue, 2007-09-18 at 09:55 -0500, Scott Marlowe wrote:
  On 9/17/07, Ow Mun Heng [EMAIL PROTECTED] wrote:
 
   Just wondering how everyone is doing aggregration of production data.
 
  Where I work, we aggregate by the minute in the application, then dump
  those stats entries into the database en masse from each machine.  So,
 [snip]
  The fail1/2/3 are types of failures, the totalresptime
  time is all the successful requests added together, so that
  totalresptime/success = average for that minute, and max is the
  longest running request ending in that minute.

 So, it's basicaly a summation rather than using avg() all the way.
 That would mean using something like bigint or something to deal with
 the large numbers

Yeah.  We use numeric(18) (bonus points if anyone knows what other
database I'm pumping data out of to get a numeric(18) as a default.)

The advantage is that I don't have to re-weight my measurements.  When
the system was first designed, it stored avg response and number of
responses, which I had to then multiply to get the original total time
and process to get a weighted average.  This method removes the
averaging step in the client software AND the multiplying step to get
the raw numbers back on my end, win-win.

   (getting min/max/count isn't much of an issue. Stdev is the main issue I
   believe)
  
   One such instance I've read about is..
 
  Isn't stddev() working for you?

 Stdev is based on a population of data. if I do slices and if I want to
 stdev across 2 months (8 weeks eg) then it would be wrong.

But if you ask for the data set grouped by
date_trunc('minute',timestmp) in an inner select, then in the outer
select you can do a stddev on that set and get it.  Or is that not
accurate?

  What I do is aggregate the minute time slices by grouping by
  date_trunc('xxx',timestamp) and then use that as a subselect to a
  query that does the stddev() outside of that.  works pretty well for
  us, and makes it easy to identify trends.

 Trending analysis is very critical and the only reference I've found on
 how to get stdev is based on what I posted.

Well, what we're doing seems to be getting us what look like proper numbers.

  One of the things we started doing is to aggregate the last six weeks
  data by the day / hour and then comparing the last 24 hours worth of
  data to those six weeks worth to see trends for each hour of each day.
   The queries are huge and ugly, but they work, and run in about 4
  minutes on a fairly good sized chunk of data.  We have about 150k to
  300k entries a day put into this db.

 I'm not sure how many K entries in a day(yet to count it) but I'm
 getting into trouble w/ one-2-many relationships and PG is choosing to
 do nested loops etc. (lots of left joins on same tables)

Yeah, for this kind of thing, you want to pre-process your tables into
one big flat table if you can.  We do have one to many relationships
within the same table (parent / child system requests are one to many)
and when we join the table to itself several times it can get hairy.
We had some nested loop problems until I upped the stats target on a
few of the columns and scheduled regular analyzes for the main table.
Now the system picks the right plan 99.9% or more of the time.  And
when it doesn't, it's usually something fresh and new that's the
issue.

 So, I've to resort to using SRF and function scans. a 2 hour Complex
 query (on my laptopn,PG, runs in 20min on high end mssql) is reduced to
 ~60secs.

That's an excellent improvement!

 Thanks for the response.

 BTW, are user queries an issue?   Does it interfere with the loading.

Not usually. I've got them trained pretty well, and the system has
lots more I/O bandwidth than CPU horsepower, so generally a long
running query just waits on I/O.  RAID-10 has made this system much
faster than the old single drive workstation (mine) it was running on.

The data loading on this system runs once a minute and it pumps the
data out of the production db, and puts it into the reporting db.
Average run time to insert those rows is  1 second.

 (That's one of my concerns over here). I noticed  that you do the calc
 at the app before sending it to the DB, which is a good thing and every
 minute too. (is it adviseable to do it in chunks of 1min? less data per
 minute vs per 30min/1 hour )

exactly.  Each minute is easy to keep track of, and if the app goes
crazy for a bit and loses the data / gets restarted, you only lose a
minutes or so worth of data.

Also, we use this system for real time alerting and performance
monitoring.  If some request or other starts taking too long or the
failure rate shoots up, it generates alerts.  If it was collected
every 30 minutes or an hour that would be far too late.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-18 Thread John D. Burger

Ow Mun Heng wrote:

The results are valid (verified with actual data) but I don't  
understand

the logic. All the Statistical books I've read marked stdev as sqrt
(sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
confusion.


A formula is not an algorithm.  In particular, the naive way of  
calculating variance or standard deviation has massive numerical  
instability problems - anything involving sums of squares does.   
There are a variety of alternate algorithms for stddev/variance, I  
presume your other algorithm is similarly trying to avoid these same  
issues (but I have not looked closely at it).  You can also see  
Wikipedia for one of the most well known, due to Knuth/Wellford:


  http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

- John D. Burger
  MITRE



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-18 Thread Ron Johnson
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/17/07 23:34, Ow Mun Heng wrote:
 Just wondering how everyone is doing aggregration of production data.
 
 Data which runs in the vicinity of a few million a week.
 
 What are the methods which will effectively provide the
 min/max/average/count/stdev of the weekly sample size based on different
 products/software mix etc.
 
 and still be able to answer correctly, what's the average of data_1 over
 the pass 2 months? 

That's the purpose of data warehouses and ETL, isn't it?

- --
Ron Johnson, Jr.
Jefferson LA  USA

Give a man a fish, and he eats for a day.
Hit him with a fish, and he goes away for good!

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFG78prS9HxQb37XmcRAhhBAKCGoBYox6azDqxQpEbvMo/Zya8cAACgy5Y6
XtrDC35IE0TOcD29Iziorfs=
=XCZw
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-18 Thread Scott Marlowe
On 9/17/07, Ow Mun Heng [EMAIL PROTECTED] wrote:

 Just wondering how everyone is doing aggregration of production data.

Where I work, we aggregate by the minute in the application, then dump
those stats entries into the database en masse from each machine.  So,
we wind up with rows something like this: (simplified)

id | parentid | host | timestamp | request | total | success | fail1 |
fail2 | fail3 | totalresptime | maxresptime
1 | NULL | jboss01 | 2007-09-17 12:02:03 | authcc | 120 | 112 | 4 | 2
| 2 | 48 | 12039
2 | 1 | jboss01 | 2007-09-17 12:02:03 | ccconnect | 120 | 118 | 0 | 0
| 2 | 423900 | 10394

where the id comes from a sequence, and parent ID ties our children
stats to their parents.  I.e. in this example authcc called ccconnect,
and so on.  The fail1/2/3 are types of failures, the totalresptime
time is all the successful requests added together, so that
totalresptime/success = average for that minute, and max is the
longest running request ending in that minute.

Then we can aggregate those minutes together, monitor individual
machine performance, etc.  Ours is actually more complex than this,
but you get the idea.

We have a cron job that checks the statistics every x minutes for high
failure rates and have it generate an alert email if any of our
requests go over a preset threshold.  This catches problems long
before anything shows up interesting in the logs most of the time.

 (getting min/max/count isn't much of an issue. Stdev is the main issue I
 believe)

 One such instance I've read about is..

Isn't stddev() working for you?

What I do is aggregate the minute time slices by grouping by
date_trunc('xxx',timestamp) and then use that as a subselect to a
query that does the stddev() outside of that.  works pretty well for
us, and makes it easy to identify trends.

One of the things we started doing is to aggregate the last six weeks
data by the day / hour and then comparing the last 24 hours worth of
data to those six weeks worth to see trends for each hour of each day.
 The queries are huge and ugly, but they work, and run in about 4
minutes on a fairly good sized chunk of data.  We have about 150k to
300k entries a day put into this db.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-18 Thread btober

Ow Mun Heng wrote:
 Data which runs in the vicinity of a few million a week.

 What are the methods which will effectively provide the
 min/max/average/count/stdev of the weekly sample size based on different
 products/software mix etc.

 and still be able to answer correctly, what's the average of data_1 over
 the pass 2 months?


So, are you viewing the problem as one of figuring out how to avoid 
having to store all this raw data permanently but still have the 
statistical summary value results available without having to 
recalculate each time?




 I can't just take the average of an 8 averages of each week)

 eg:
 wk   avg data_1
 w1 - 2
 ...average of past 2 months = ave(w1-w8) which is statistically wrong.


One approach would be to to save the count of values as well as the 
average. Then your eight-week average is calculated as a weighted 
average, i.e., each weeks average is weighted by the corresponding count 
of data values:


CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric
);

Then, for instance, the eight-week average is computed as

SELECT
   AVG(num_of_values * avg_of_values)/SUM(num_of_values) AS eight_week_avg
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();




 (getting min/max/count isn't much of an issue. Stdev is the main issue I
 believe)

You probably need to store each of those weekly summary statistic values 
too, at least for min and max, i.e.,


CREATE TABLE summary_data
(
summary_date timestamp,
num_of_values integer,
avg_of_values numeric,
min_of_values numeric,
max_of_values numeric
);

Then, for instance, overall basic statistics are computed as

SELECT
   COUNT(num_of_values),
   AVG(num_of_values * avg_of_values),
   MIN(min_of_values),
   MAX(max_of_values)
FROM summary_data
WHERE summary_date BETWEEN (eight weeks ago) AND now();


Extending this design to include the variance is a more complicated. 
While you can compute the average for the the past eight weeks using a 
weighted average of each of the separate eight weeks, I think you 
actually need the raw data values for the whole eight weeks to figure 
the standard deviation -- I don't readily see how you could without it. 
A different possibility would be to maintain a running average and 
variance (rather than windowing those statistics for the sliding 
eight-week period), and then taking a weekly snap shot of the running 
values to use for trending.



 One such instance I've read about is..

 1. Calculate sum of square of each sale
 ...
 8. Stdev will be the square root of step 7

 The results are valid (verified with actual data) but I don't understand
 the logic. All the Statistical books I've read marked stdev as sqrt
 (sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
 confusion.

I think you copied the std dev formula incorrectly, but the eight step 
algorithm can be shown to get you to the same point, after a little 
analytical manipulation. The advantage to one over the other is that the 
eight-step procedure can be used to implement an on-the-fly calculation 
of mean and variance, i.e., you can maintain a running average and 
update the statistics as you collect more data and not have to maintain 
the entire detail data set history, as would be required by an attempt 
to implement the definition directly.







---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-18 Thread Ow Mun Heng
On Tue, 2007-09-18 at 09:55 -0500, Scott Marlowe wrote:
 On 9/17/07, Ow Mun Heng [EMAIL PROTECTED] wrote:
 
  Just wondering how everyone is doing aggregration of production data.
 
 Where I work, we aggregate by the minute in the application, then dump
 those stats entries into the database en masse from each machine.  So,
[snip]
 The fail1/2/3 are types of failures, the totalresptime
 time is all the successful requests added together, so that
 totalresptime/success = average for that minute, and max is the
 longest running request ending in that minute.

So, it's basicaly a summation rather than using avg() all the way.
That would mean using something like bigint or something to deal with
the large numbers
 
  (getting min/max/count isn't much of an issue. Stdev is the main issue I
  believe)
 
  One such instance I've read about is..
 
 Isn't stddev() working for you?

Stdev is based on a population of data. if I do slices and if I want to
stdev across 2 months (8 weeks eg) then it would be wrong.

 
 What I do is aggregate the minute time slices by grouping by
 date_trunc('xxx',timestamp) and then use that as a subselect to a
 query that does the stddev() outside of that.  works pretty well for
 us, and makes it easy to identify trends.

Trending analysis is very critical and the only reference I've found on
how to get stdev is based on what I posted.


 One of the things we started doing is to aggregate the last six weeks
 data by the day / hour and then comparing the last 24 hours worth of
 data to those six weeks worth to see trends for each hour of each day.
  The queries are huge and ugly, but they work, and run in about 4
 minutes on a fairly good sized chunk of data.  We have about 150k to
 300k entries a day put into this db.

I'm not sure how many K entries in a day(yet to count it) but I'm
getting into trouble w/ one-2-many relationships and PG is choosing to
do nested loops etc. (lots of left joins on same tables)

So, I've to resort to using SRF and function scans. a 2 hour Complex
query (on my laptopn,PG, runs in 20min on high end mssql) is reduced to
~60secs.


Thanks for the response.

BTW, are user queries an issue? Does it interfere with the loading.
(That's one of my concerns over here). I noticed  that you do the calc
at the app before sending it to the DB, which is a good thing and every
minute too. (is it adviseable to do it in chunks of 1min? less data per
minute vs per 30min/1 hour )

---(end of broadcast)---
TIP 6: explain analyze is your friend


[GENERAL] Q:Aggregrating Weekly Production Data. How do you do it?

2007-09-17 Thread Ow Mun Heng
Just wondering how everyone is doing aggregration of production data.

Data which runs in the vicinity of a few million a week.

What are the methods which will effectively provide the
min/max/average/count/stdev of the weekly sample size based on different
products/software mix etc.

and still be able to answer correctly, what's the average of data_1 over
the pass 2 months? 

I can't just take the average of an 8 averages of each week)

eg:
wk   avg data_1
w1 - 2
w2 - 2
w3 - 2
w4 - 3
w5 - 1
w6 - 2
w7 - 2
w8 - 2
average of past 2 months = ave(w1-w8) which is statistically wrong.

using sum of data_1 per week would work though. Please share your
expertise / experience.

(getting min/max/count isn't much of an issue. Stdev is the main issue I
believe)

One such instance I've read about is..

From this website : (it references using SQL Server Analysis services
but I think the concept is the same)

http://www.phptr.com/articles/printerfriendly.asp?p=337135rl=1

1. Calculate sum of square of each sale
2. multiple the result of step 1 by the sales count
3. sum all sales
4. Square the result of step 3
5. Substract the result of step 4 from the result of step 2
6. Multiply the sales count by one less than sales count (sales_count
* (sales_count - 1))
7. Divide the result of step 5 by the result of step 6
8. Stdev will be the square root of step 7

The results are valid (verified with actual data) but I don't understand
the logic. All the Statistical books I've read marked stdev as sqrt
(sum(x - ave(x))^2 / (n - 1). The formula is very different, hence the
confusion. 


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/