[PERFORM] inconsistent/weird index usage

2004-10-01 Thread Dustin Sallings
To save some time, let me start by saying
PostgreSQL 7.4.3 on powerpc-apple-darwin7.4.0, compiled by GCC gcc 
(GCC) 3.3 20030304 (Apple Computer, Inc. build 1640)

OK, now on to details...
	I'm trying to implement oracle style ``partitions'' in postgres.  I've 
run into my first snag on what should be a fairly quick query.  
Basically, I started with the following schema and split the 
``samples'' table into one table for each year (1999-2004).

-- BEGIN SCHEMA
create table sensor_types (
sensor_type_id serial,
sensor_type text not null,
units varchar(10) not null,
primary key(sensor_type_id)
);
create table sensors (
sensor_id serial,
sensor_type_id integer not null,
serial char(16) not null,
name text not null,
low smallint not null,
high smallint not null,
active boolean default true,
primary key(sensor_id),
foreign key(sensor_type_id) references sensor_types(sensor_type_id)
);
create unique index sensors_byserial on sensors(serial);
create table samples (
ts datetime not null,
sensor_id integer not null,
sample float not null,
foreign key(sensor_id) references sensors(sensor_id)
);
create index samples_bytime on samples(ts);
create unique index samples_bytimeid on samples(ts, sensor_id);
-- END SCHEMA
	Each samples_[year] table looks, and is indexed exactly as the above 
samples table was by using the following commands:

create index samples_1999_bytime on samples_1999(ts);
create index samples_2000_bytime on samples_2000(ts);
create index samples_2001_bytime on samples_2001(ts);
create index samples_2002_bytime on samples_2002(ts);
create index samples_2003_bytime on samples_2003(ts);
create index samples_2004_bytime on samples_2004(ts);
create unique index samples_1999_bytimeid on samples_1999(ts, 
sensor_id);
create unique index samples_2000_bytimeid on samples_2000(ts, 
sensor_id);
create unique index samples_2001_bytimeid on samples_2001(ts, 
sensor_id);
create unique index samples_2002_bytimeid on samples_2002(ts, 
sensor_id);
create unique index samples_2003_bytimeid on samples_2003(ts, 
sensor_id);
create unique index samples_2004_bytimeid on samples_2004(ts, 
sensor_id);

The tables contain the following number of rows:
samples_1999311030
samples_20002142245
samples_20012706571
samples_20023111602
samples_20033149316
samples_20042375972
	The following view creates the illusion of the old ``single-table'' 
model:

create view samples as
select * from samples_1999
 union  select * from samples_2000
 union  select * from samples_2001
 union  select * from samples_2002
 union  select * from samples_2003
 union  select * from samples_2004
	...along with the following rule on the view for the applications 
performing inserts:

create rule sample_rule as on insert to samples
  do instead
  insert into samples_2004 (ts, sensor_id, sample)
values(new.ts, new.sensor_id, new.sample)
	OK, now that that's over with, I have this one particular query that I 
attempt to run for a report from my phone that no longer works because 
it tries to do a table scan on *some* of the tables.  Why it chooses 
this table scan, I can't imagine.  The query is as follows:

select
s.serial as serial_num,
s.name as name,
date(ts) as day,
min(sample) as min_temp,
avg(sample) as avg_temp,
stddev(sample) as stddev_temp,
max(sample) as max_temp
from
samples inner join sensors s using (sensor_id)
where
ts  current_date - 7
group by
serial_num, name, day
order by
serial_num, day desc
	explain analyze reports the following (sorry for the horrible 
wrapping):

 Sort  (cost=1185281.45..1185285.95 rows=1800 width=50) (actual 
time=82832.106..82832.147 rows=56 loops=1)
   Sort Key: s.serial, date(samples.ts)
   -  HashAggregate  (cost=1185161.62..1185184.12 rows=1800 width=50) 
(actual time=82830.624..82831.601 rows=56 loops=1)
 -  Hash Join  (cost=1063980.21..1181539.96 rows=206952 
width=50) (actual time=80408.123..81688.590 rows=66389 loops=1)
   Hash Cond: (outer.sensor_id = inner.sensor_id)
   -  Subquery Scan samples  (cost=1063979.10..1155957.38 
rows=4598914 width=20) (actual time=80392.477..80922.764 rows=66389 
loops=1)
 -  Unique  (cost=1063979.10..1109968.24 
rows=4598914 width=20) (actual time=80392.451..80646.761 rows=66389 
loops=1)
   -  Sort  (cost=1063979.10..1075476.39 
rows=4598914 width=20) (actual time=80392.437..80442.787 rows=66389 
loops=1)
 Sort Key: ts, sensor_id, sample
 -  Append  (cost=0.00..312023.46 
rows=4598914 width=20) (actual time=79014.428..80148.396 rows=66389 
loops=1)
   -  Subquery Scan *SELECT* 1  
(cost=0.00..9239.37 rows=103677 width=20) (actual 
time=4010.181..4010.181 rows=0 loops=1)
 

Re: [PERFORM] inconsistent/weird index usage

2004-10-01 Thread John Meinel
Dustin Sallings wrote:

[...]
OK, now that that's over with, I have this one particular query that 
I attempt to run for a report from my phone that no longer works because 
it tries to do a table scan on *some* of the tables.  Why it chooses 
this table scan, I can't imagine.  The query is as follows:

select
s.serial as serial_num,
s.name as name,
date(ts) as day,
min(sample) as min_temp,
avg(sample) as avg_temp,
stddev(sample) as stddev_temp,
max(sample) as max_temp
from
samples inner join sensors s using (sensor_id)
where
ts  current_date - 7
group by
serial_num, name, day
order by
serial_num, day desc

[ next section heavily clipped for clarity ]
-  Seq Scan on samples_1999  (cost rows=103677) (actual rows=0 loops=1)
-  Index Scan using samples_2000_bytime on samples_2000  (cost 
rows=714082 (actual rows=0 loops=1)

-  Seq Scan on samples_2001  (cost rows=902191) (actual rows=0 loops=1)
-  Seq Scan on samples_2002  (cost rows=1037201) (actual rows=0 loops=1)
-  Index Scan using samples_2003_bytime on samples_2003  (cost 
rows=1049772) (actual rows=0 loops=1)

-  Index Scan using samples_2004_bytime on samples_2004  (cost 
rows=791991) (actual rows=66389 loops=1)

[...]

Essentially, what you can see here is that it's doing an index scan 
on samples_2000, samples_2003, and samples_2004, but a sequential scan 
on samples_1999, samples_2001, and samples_2002.  It's very strange to 
me that it would make these choices.  If I disable sequential scans 
altogether for this session, the query runs in under 4 seconds.

This is a very cool solution for long-term storage, and isn't 
terribly hard to manage.  I actually have other report queries that seem 
to be making pretty good index selection currently...but I always want 
more!  :)  Does anyone have any suggestions as to how to get this to do 
what I want?

Of course, ideally, it would ignore five of the tables altogether.  :)
--
SPY  My girlfriend asked me which one I like better.
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings [EMAIL PROTECTED]
|Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L___ I hope the answer won't upset her. 

Just as a heads up. You have run vacuum analyze before running this 
query, correct?

Because you'll notice that the query planner is thinking that it will 
have 103677 rows from 1999, 700,000 rows from 2000, 900,000 rows from 
2001, etc, etc. Obviously the query planner is not planning well 
considering it there are only 60,000 rows from 2004, and no rows from 
anything else.

It just seems like it hasn't updated it's statistics to be aware of when 
the time is on most of the tables.

(By the way, an indexed scan returning 0 entries is *really* fast, so I 
wouldn't worry about ignoring the extra tables. :)

I suppose the other question is whether this is a prepared or stored 
query. Because sometimes the query planner cannot do enough optimization 
in a stored query. (I ran into this problem where I had 1 column with 
500,000+ entries referencing 1 number. If I ran manually, the time was 
much better because I wasn't using *that* number. With a stored query, 
it had to take into account that I *might* use that number, and didn't 
want to do 500,000+ indexed lookups)

The only other thing I can think of is that there might be some 
collision between datetime and date. Like it is thinking it is looking 
at the time of day when it plans the queries (hence why so many rows), 
but really it is looking at the date. Perhaps a cast is in order to make 
it work right. I don't really know.

Interesting problem, though.
John
=:-


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] inconsistent/weird index usage

2004-10-01 Thread Tom Lane
Dustin Sallings [EMAIL PROTECTED] writes:
   The following view creates the illusion of the old ``single-table'' 
 model:

 create view samples as
  select * from samples_1999
   union  select * from samples_2000
   union  select * from samples_2001
   union  select * from samples_2002
   union  select * from samples_2003
   union  select * from samples_2004

You really, really, really want to use UNION ALL not UNION here.

   OK, now that that's over with, I have this one particular query that I 
 attempt to run for a report from my phone that no longer works because 
 it tries to do a table scan on *some* of the tables.  Why it chooses 
 this table scan, I can't imagine.

Most of the problem here comes from the fact that current_date - 7
isn't reducible to a constant and so the planner is making bad guesses
about how much of each table will be scanned.  If possible, do the date
arithmetic on the client side and send over a simple literal constant.
If that's not practical you can fake it with a mislabeled IMMUTABLE
function --- see the list archives for previous discussions of the
same issue.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] inconsistent/weird index usage

2004-10-01 Thread Richard Huxton
Dustin Sallings wrote:
The following view creates the illusion of the old ``single-table'' 
model:

create view samples as
select * from samples_1999
 union  select * from samples_2000
 union  select * from samples_2001
 union  select * from samples_2002
 union  select * from samples_2003
 union  select * from samples_2004
Try this with UNION ALL (you know there won't be any duplicates) and 
possibly with some limits too:

SELECT * FROM samples_1999 WHERE ts BETWEEN '1999-01-01 00:00:00+00' AND 
'1999-12-31 11:59:59+00'
UNION ALL ...

select
s.serial as serial_num,
s.name as name,
date(ts) as day,
min(sample) as min_temp,
avg(sample) as avg_temp,
stddev(sample) as stddev_temp,
max(sample) as max_temp
from
samples inner join sensors s using (sensor_id)
where
ts  current_date - 7
group by
serial_num, name, day
order by
serial_num, day desc
Try restricting the timestamp too
WHERE
  ts BETWEEN (current_date -7) AND current_timestamp
Hopefully that will give the planner enough smarts to know it can skip 
most of the sample_200x tables.

--
  Richard Huxton
  Archonet Ltd
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Jim C. Nasby
On Fri, Oct 01, 2004 at 06:43:42AM +0100, Matt Clark wrote:
 
 If you're not using a connection pool of some kind then you might as
 well forget query plan caching, because your connect overhead will swamp
 the planning cost. This does not mean you have to use something like
 pgpool (which makes some rather questionable claims IMO); any decent web
 application language/environment will support connection pooling.
  
 
 Hmm, a question of definition -  there's a difference between a pool and 
 a persistent connection.  Pretty much all web apps have one connection 
 per process, which is persistent (i.e. not dropped and remade for each 
 request), but not shared between processes, therefore not pooled.

OK, that'd work too... the point is if you're re-connecting all the time
it doesn't really matter what else you do for performance.
-- 
Jim C. Nasby, Database Consultant   [EMAIL PROTECTED] 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: Where do you want to go today?
Linux: Where do you want to go tomorrow?
FreeBSD: Are you guys coming, or what?

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Matt Clark
 OK, that'd work too... the point is if you're re-connecting 
 all the time it doesn't really matter what else you do for 
 performance.

Yeah, although there is the chap who was asking questions on the list
recently who had some very long-running code on his app servers, so was best
off closing the connection because he had far too many postmaster processes
just sitting there idle all the time!

But you're right, it's a killer usually.

M 


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [PERFORM] inconsistent/weird index usage

2004-10-01 Thread Josh Berkus
Tom,

 Most of the problem here comes from the fact that current_date - 7
 isn't reducible to a constant and so the planner is making bad guesses
 about how much of each table will be scanned.  

I thought this was fixed in 7.4.  No?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] inconsistent/weird index usage

2004-10-01 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Most of the problem here comes from the fact that current_date - 7
 isn't reducible to a constant and so the planner is making bad guesses
 about how much of each table will be scanned.  

 I thought this was fixed in 7.4.  No?

No.  It's not fixed as of CVS tip either, although there was some talk
of doing something in time for 8.0.

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Aaron Werman
The context of the discussion was a hack to speed queries against static
tables, so MVCC is not relevent.  As soon as any work unit against a
referenced table commits, the cache is invalid, and in fact the table
shouldn't be a candidate for this caching for a while. In fact, this cache
would reduce some the MVCC 'select count(*) from us_states' type of horrors.

(The attraction of a server side cache is obviously that it could *with no
server or app changes* dramatically improve performance. A materialized view
is a specialized denormalization-ish mechanism to optimize a category of
queries and requires the DBA to sweat the details. It is very hard to cache
things stochastically without writing a server. Trigger managed extracts
won't help you execute 1,000 programs issuing the  query  select sec_level
from sec where division=23 each second or a big table loaded monthly.)



- Original Message - 
From: Jeff [EMAIL PROTECTED]
To: Mitch Pirtle [EMAIL PROTECTED]
Cc: Aaron Werman [EMAIL PROTECTED]; Scott Kirkwood
[EMAIL PROTECTED]; Neil Conway [EMAIL PROTECTED];
[EMAIL PROTECTED]; Tom Lane [EMAIL PROTECTED]
Sent: Monday, September 27, 2004 2:25 PM
Subject: Re: [PERFORM] Caching of Queries


 [ discussion of server side result caching ]

 and lets not forget PG's major fork it will throw into things:  MVCC
 The results of query A may hold true for txn 1, but not txn 2 and so on
 .
 That would have to be taken into account as well and would greatly
 complicate things.

 It is always possible to do a poor man's query cache with triggers..
 which would just leave you with basically a materialized view.

 --
 Jeff Trout [EMAIL PROTECTED]
 http://www.jefftrout.com/
 http://www.stuarthamm.net/


 ---(end of broadcast)---
 TIP 8: explain analyze is your friend


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Josh Berkus
People:

Transparent query caching is the industry standard for how these things 
are handled.   However, Postgres' lack of this feature has made me consider 
other approaches, and I'm starting to wonder if the standard query caching 
-- where a materialized query result, or some reduction thereof, is cached in 
database memory -- isn't the best way to cache things.  I'm going to 
abbreviate it SQC for the rest of this e-mail.

Obviously, the draw of SQC is its transparency to developers.   With it, the 
Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you 
set it up, give it some RAM, and it just works.   As someone who frequently 
has to consult based on limited knowledge, I can understand the appeal.

However, one of the problems with SQC, aside from the ones already mentioned 
of stale data and/or cache-clearing, is that (at least in applications like 
MySQL's) it is indiscriminate and caches, at least breifly, unique queries as 
readily as common ones.   Possibly Oracle's implementation is more 
sophisticated; I've not had an opportunity.

The other half of that problem is that an entire query is cached, rather than 
just the relevant data to uniquely identify the request to the application.
This is bad in two respects; one that the entire query needs to be parsed to 
see if a new query is materially equivalent, and that two materially 
different queries which could utilize overlapping ranges of the same 
underlying result set must instead cache their results seperately, eating up 
yet more memory.

To explain what I'm talking about, let me give you a counter-example of 
another approach.

I have a data-warehousing application with a web front-end.The data in the 
application is quite extensive and complex, and only a summary is presented 
to the public users -- but that summary is a query involving about 30 lines 
and 16 joins.  This summary information is available in 3 slightly different 
forms.  Further, the client has indicated that an up to 1/2 hour delay in 
data freshness is acceptable.

The first step is forcing that materialized view of the data into memory.  
Right now I'm working on a reliable way to do that without using Memcached, 
which won't install on our Solaris servers.  Temporary tables have the 
annoying property of being per-connection, which doesn't work in a pool of 60 
connections.

The second step, which I completed first due to the lack of technical 
obstacles, is to replace all queries against this data with calls to a 
Set-Returning Function (SRF).   This allowed me to re-direct where the data 
was coming from -- presumably the same thing could be done through RULES, but 
it would have been considerably harder to implement.

The first thing the SRF does is check the criteria passed to it against a set 
of cached (in a table) criteria with that user's permission level which is  
1/2 hour old.   If the same criteria are found, then the SRF is returned a 
set of row identifiers for the materialized view (MV), and looks up the rows 
in the MV and returns those to the web client.   

If no identical set of criteria are found, then the query is run to get a set 
of identifiers which are then cached, and the SRF returns the queried rows.

Once I surmount the problem of storing all the caching information in 
protected memory, the advantages of this approach over SQC are several:

1) The materialized data is available in 3 different forms; a list, a detail 
view, and a spreadsheet.  Each form as somewhat different columns and 
different rules about ordering, which would likely confuse an SQC planner.   
In this implementation, all 3 forms are able to share the same cache.

2) The application is comparing only sets of unambguous criteria rather than 
long queries which would need to be compared in planner form in order to 
determine query equivalence. 

3) With the identifier sets, we are able to cache other information as well, 
such as a count of rows, further limiting the number of queries we must run.

4) This approach is ideally suited to the pagination and re-sorting common to 
a web result set.  As only the identifiers are cached, the results can be 
re-sorted and broken in to pages after the cache read, a fast, all-in-memory 
operation.

In conclusion, what I'm saying is that while forms of transparent query 
caching (plan, materialized or whatever) may be desirable for other reasons, 
it's quite possible to acheive a superior level of query caching through 
tight integration with the front-end application.   

If people are interested in this, I'd love to see some suggestions on ways to 
force the materialized view into dedicated memory.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


[PERFORM] Slow update/insert process

2004-10-01 Thread Patrick Hatcher

Pg: 7.4.5
RH 7.3
8g Ram
200 g drive space
RAID0+1
Tables vacuum on a nightly basis

The following process below takes 8
hours to run on 90k records and I'm not sure where to being to look for
the bottleneck. This isn't the only updating on this database that
seems to take a long time to complete. Is there something I should
be looking for in my conf settings? 

TIA
Patrick


SQL:
---Bring back only selected records
to run through the update process.
--Without the function the SQL takes
 10secs to return 90,000 records
SELECT count(pm.pm_delta_function_amazon(upc.keyp_upc,'amazon'))
FROM mdc_upc upc
JOIN public.mdc_products prod ON upc.keyf_products
= prod.keyp_products
JOIN public.mdc_price_post_inc price
ON prod.keyp_products = price.keyf_product
JOIN public.mdc_attribute_product ap
on ap.keyf_products = prod.keyp_products and keyf_attribute=22
WHERE 
upper(trim(ap.attributevalue)) NOT IN
('ESTEE LAUDER', 'CLINIQUE','ORGINS','PRESCRIPTIVES','LANC?ME','CHANEL','ARAMIS','M.A.C','TAG
HEUER')
AND keyf_producttype222
AND prod.action_publish = 1;


Function:

CREATE OR REPLACE FUNCTION pm.pm_delta_function_amazon(int4, varchar)
 RETURNS bool AS
'DECLARE
  varkeyf_upc  
 ALIAS FOR $1;
  varPassword  
 ALIAS FOR $2;
  varRealMD5  
 varchar;
  varDeltaMD5  
 varchar;
  varLastTouchDatedate;
  varQuery  
  text;
  varQuery1  
  text;
  varQueryMD5  
 text;
  varQueryRecordrecord;
  varFuncStatusboolean
:= false;
  
BEGIN

-- Check the password
 IF varPassword  \'amazon\' THEN
  Return false;
 END IF;


-- Get the md5 hash for this product
 SELECT into varQueryRecord md5(upc.keyp_upc || prod.description
|| pm.pm_price_post_inc(prod.keyp_products)) AS md5
  FROM public.mdc_upc upc
  JOIN public.mdc_products prod ON upc.keyf_products = prod.keyp_products
  JOIN public.mdc_price_post_inc price ON price.keyf_product
= prod.keyp_products
  WHERE upc.keyp_upc = varkeyf_upc LIMIT 1 ;
 
 
 IF NOT FOUND THEN
  RAISE EXCEPTION \'varRealMD5 is NULL. UPC ID is %\', varkeyf_upc;
 ELSE
  varRealMD5:=varQueryRecord.md5;
 END IF;

-- Check that the product is in the delta table and return its hash
for comparison 
 SELECT into varQueryRecord md5_hash,last_touch_date 
  FROM pm.pm_delta_master_amazon
  WHERE keyf_upc = varkeyf_upc LIMIT 1;

 IF NOT FOUND THEN
  -- ADD and exit
  INSERT INTO pm.pm_delta_master_amazon (keyf_upc,status,md5_hash,last_touch_date)
  values (varkeyf_upc,\'add\',varRealMD5,CURRENT_DATE);
  varFuncStatus:=true;
  RETURN varFuncStatus;
 ELSE
  --Update the record 
   --- If the hash matches then set the record to HOLD
  IF varRealMD5 = varQueryRecord.md5_hash THEN
UPDATE pm.pm_delta_master_amazon
 SET status= \'hold\',
 last_touch_date = CURRENT_DATE
WHERE keyf_upc = varkeyf_upc
AND last_touch_date  CURRENT_DATE; 
varFuncStatus:=true;
  ELSE
-- ELSE mark the item as ADD
UPDATE pm.pm_delta_master_amazon
 SET status= \'add\',
 last_touch_date = CURRENT_DATE
WHERE keyf_upc = varkeyf_upc;
varFuncStatus:=true;
  END IF;
  END IF;

 RETURN varFuncStatus;
END;'
 LANGUAGE 'plpgsql' IMMUTABLE;



TableDef
CREATE TABLE pm.pm_delta_master_amazon
( 
  keyf_upc  
 int4 ,
  status   
 varchar(6) ,
  md5_hash  
 varchar(40) ,
  last_touch_date 
  date 
  )
GO

CREATE INDEX status_idx
  ON pm.pm_delta_master_amazon(status)
GO





CONF

# WRITE AHEAD LOG
#---

# - Settings -

#fsync = true
 # turns forced synchronization on or
off
#wal_sync_method = fsync  
 # the default varies across platforms:
 
  #
fsync, fdatasync, open_sync, or open_datasync
wal_buffers = 32   
# min 4, 8KB each

# - Checkpoints -

checkpoint_segments = 50  
 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 600  
 # range 30-3600, in seconds
#checkpoint_warning = 30  
 # 0 is off, in seconds
#commit_delay = 0   
# range 0-10, in microseconds
#commit_siblings = 5   
  # range 1-1000


Patrick Hatcher
Macys.Com


Re: [PERFORM] Slow update/insert process

2004-10-01 Thread Aaron Werman



Some quick notes:

- Using a side effect of a function to update the 
database feels bad to me
- how long does theSELECT into varQueryRecord 
md5(upc.keyp
 function take / what does it's explain look 
like?
- There are a lot of non-indexed columns on that 
delta master table, such as keyf_upc. 

 I'm guessing you're doing 
90,000 x {a lot of slow scans}
- My temptation would be to rewrite the processing 
to do a pass of updates, a pass of inserts, 
 and then the SELECT

  - Original Message - 
  From: 
  Patrick 
  Hatcher 
  To: [EMAIL PROTECTED] 
  
  Sent: Friday, October 01, 2004 2:14 
  PM
  Subject: [PERFORM] Slow update/insert 
  process
  Pg: 7.4.5 RH 
  7.3 8g Ram 200 g drive space RAID0+1 Tables vacuum on a 
  nightly basis The following 
  process below takes 8 hours to run on 90k records and I'm not sure where to 
  being to look for the bottleneck. This isn't the only updating on this 
  database that seems to take a long time to complete. Is there something 
  I should be looking for in my conf settings?  TIA Patrick SQL: 
  ---Bring back only selected records to run 
  through the update process. --Without 
  the function the SQL takes  10secs to return 90,000 records 
  SELECT 
  count(pm.pm_delta_function_amazon(upc.keyp_upc,'amazon')) FROM mdc_upc upc JOIN public.mdc_products prod ON upc.keyf_products = 
  prod.keyp_products JOIN 
  public.mdc_price_post_inc price ON prod.keyp_products = 
  price.keyf_product JOIN 
  public.mdc_attribute_product ap on ap.keyf_products = prod.keyp_products and 
  keyf_attribute=22 WHERE 
  upper(trim(ap.attributevalue)) NOT IN 
  ('ESTEE LAUDER', 
  'CLINIQUE','ORGINS','PRESCRIPTIVES','LANC?ME','CHANEL','ARAMIS','M.A.C','TAG 
  HEUER') AND 
  keyf_producttype222 AND 
  prod.action_publish = 1; Function: CREATE OR 
  REPLACE FUNCTION pm.pm_delta_function_amazon(int4, "varchar")RETURNS 
  bool AS'DECLARE varkeyf_upc 
 ALIAS FOR $1; varPassword  
ALIAS FOR $2; 
  varRealMD5
  varchar; varDeltaMD5  
varchar; varLastTouchDate   
   date; varQuery  
 text; varQuery1
   text; varQueryMD5  
text; 
  varQueryRecordrecord; 
  varFuncStatusboolean := false; 
  BEGIN-- Check the passwordIF varPassword  
  \'amazon\' THEN Return false;END IF;-- 
  Get the md5 hash for this productSELECT into varQueryRecord 
  md5(upc.keyp_upc || prod.description || 
  pm.pm_price_post_inc(prod.keyp_products)) AS md5 FROM 
  public.mdc_upc upc JOIN public.mdc_products prod ON 
  upc.keyf_products = prod.keyp_products JOIN 
  public.mdc_price_post_inc price ON price.keyf_product = 
  prod.keyp_products WHERE upc.keyp_upc = varkeyf_upc LIMIT 1 
  ;IF NOT FOUND THEN RAISE EXCEPTION 
  \'varRealMD5 is NULL. UPC ID is %\', varkeyf_upc;ELSE 
  varRealMD5:=varQueryRecord.md5;END IF;-- Check 
  that the product is in the delta table and return its hash for comparison 
  SELECT into varQueryRecord md5_hash,last_touch_date  
  FROM pm.pm_delta_master_amazon WHERE keyf_upc = 
  varkeyf_upc LIMIT 1;IF NOT FOUND THEN -- ADD and 
  exit INSERT INTO pm.pm_delta_master_amazon 
  (keyf_upc,status,md5_hash,last_touch_date) values 
  (varkeyf_upc,\'add\',varRealMD5,CURRENT_DATE); 
  varFuncStatus:=true; RETURN 
  varFuncStatus;ELSE --Update the record  
   --- If the hash matches then set the record to HOLD 
  IF varRealMD5 = varQueryRecord.md5_hash THEN   
  UPDATE pm.pm_delta_master_amazon
  SET status= \'hold\',
  last_touch_date = CURRENT_DATE   WHERE 
  keyf_upc = varkeyf_upc AND last_touch_date  CURRENT_DATE; 
 varFuncStatus:=true; 
  ELSE   -- ELSE mark the item as 
  ADD   UPDATE pm.pm_delta_master_amazon 
 SET status= \'add\',
  last_touch_date = CURRENT_DATE   WHERE 
  keyf_upc = varkeyf_upc;   
  varFuncStatus:=true; END IF; END 
  IF;RETURN varFuncStatus;END;'LANGUAGE 'plpgsql' 
  IMMUTABLE;TableDef 
  CREATE TABLE pm.pm_delta_master_amazon ( 
keyf_upc   
   int4 ,   status
   varchar(6) ,   
  md5_hashvarchar(40) , 
last_touch_date   
   date   
  ) GO CREATE INDEX status_idx   ON 
  pm.pm_delta_master_amazon(status) GO CONF  # WRITE AHEAD 
  LOG #--- 
  # - Settings - #fsync = true   
 # turns forced synchronization on or off #wal_sync_method = fsync# 
  the default varies across platforms:   
# fsync, fdatasync, open_sync, or 
  open_datasync wal_buffers = 32  
# min 4, 8KB each 
  # - Checkpoints - checkpoint_segments = 50# 
  in logfile segments, min 1, 16MB each checkpoint_timeout = 600# range 30-3600, in 
  seconds #checkpoint_warning = 30 
 # 0 is off, in seconds #commit_delay = 0  
# range 0-10, in microseconds #commit_siblings = 5  
  # range 1-1000 Patrick 
  HatcherMacys.Com


Re: [PERFORM] Caching of Queries

2004-10-01 Thread Aaron Werman
I'm not sure I understand your req fully. If the same request is repeatedly
done with same parameters, you could implement a proxy web server with a
croned script to purge stale pages. If there is substantially the same data
being summarized, doing your own summary tables works; if accessed enough,
they're in memory. I interleaved some notes into your posting.

- Original Message - 

From: Josh Berkus [EMAIL PROTECTED]

To: Postgresql Performance [EMAIL PROTECTED]

Sent: Friday, October 01, 2004 1:10 PM

Subject: Re: [PERFORM] Caching of Queries




 People:

 Transparent query caching is the industry standard for how these
things
 are handled.   However, Postgres' lack of this feature has made me
consider
 other approaches, and I'm starting to wonder if the standard query
caching
 -- where a materialized query result, or some reduction thereof, is cached
in
 database memory -- isn't the best way to cache things.  I'm going to
 abbreviate it SQC for the rest of this e-mail.

 Obviously, the draw of SQC is its transparency to developers.   With it,
the
 Java/Perl/PHP programmers and the DBA don't have to communicate at all -- 
you
 set it up, give it some RAM, and it just works.   As someone who
frequently
 has to consult based on limited knowledge, I can understand the appeal.

My sense is that pg is currently unique among popular dbmses in having the
majority of applications being homegrown (a chicken / egg / advocacy issue -
if I install a CMS, I'm not the DBA or the PHP programmer - and I don't want
to change the code; we'll see more about this when native WinPg happens).





 However, one of the problems with SQC, aside from the ones already
mentioned
 of stale data and/or cache-clearing, is that (at least in applications
like
 MySQL's) it is indiscriminate and caches, at least breifly, unique queries
as
 readily as common ones.   Possibly Oracle's implementation is more
 sophisticated; I've not had an opportunity.

I'm not sure I agree here. Stale data and caching choice are
optimizer/buffer manager choices and implementation can decide whether to
allow stale data. These are design choices involving development effort and
choices of where to spend server cycles and memory. All buffering choices
cache unique objects, I'm not sure why this is bad (but sensing you want
control of the choices). FWIW, this is my impression of other dbmses.

In MySQL, a global cache can be specified with size and globally, locally,
or through statement hints in queries to suggest caching results. I don't
believe that these could be used as common subexpressions (with an exception
of MERGE table component results). The optimizer knows nothing about the
cached results - SQL select statements are hashed, and can be replaced by
the the cached statement/results on a match.

In DB2 and Oracle result sets are not cached. They have rich sets of
materialized view features (that match your requirements). They allow a
materialized view to be synchronous with table updates or asynchronous.
Synchronous is often an unrealistic option, and asynchronous materialized
views are refreshed at a specified schedule. The optimizers allow query
rewrite (in Oracle it is a session option) so one can connect to the
database and specify that the optimizer is allowed to replace subexpressions
with data from (possibly stale) materialized views. SQL Server 2K has more
restrictive synchronous MVs, but I've never used them.

So, in your example use in Oracle, you would need to define appropriate MVs
with a ½ hour refresh frequency, and hope that the planner would use them in
your queries. The only change in the app is on connection you would allow
use of asynchronous stale data.

You're suggesting an alternative involving identifying common, but
expensive, subexpressions and generating MVs for them. This is a pretty
sophisticated undertaking, and probably requires some theory research to
determine if it's viable.



 The other half of that problem is that an entire query is cached, rather
than
 just the relevant data to uniquely identify the request to the
application.
 This is bad in two respects; one that the entire query needs to be parsed
to
 see if a new query is materially equivalent, and that two materially
 different queries which could utilize overlapping ranges of the same
 underlying result set must instead cache their results separately, eating
up
 yet more memory.

There are two separate issues. The cost of parse/optimization and the cost
of results retrieval. Other dbmses hash statement text. This is a good
thing, and probably 3 orders of magnitude faster than parse and
optimization. (Oracle also has options to replace literals with parameters
and match parse trees instead of text, expecting parse costs to be less than
planning costs.) MySQL on a match simply returns the result set. Oracle and
DB2 attempt to rewrite queries to use the DBA selected extracts. The MySQL
approach seems to be almost what you're describing: all it needs