[PERFORM] inconsistent/weird index usage
To save some time, let me start by saying PostgreSQL 7.4.3 on powerpc-apple-darwin7.4.0, compiled by GCC gcc (GCC) 3.3 20030304 (Apple Computer, Inc. build 1640) OK, now on to details... I'm trying to implement oracle style ``partitions'' in postgres. I've run into my first snag on what should be a fairly quick query. Basically, I started with the following schema and split the ``samples'' table into one table for each year (1999-2004). -- BEGIN SCHEMA create table sensor_types ( sensor_type_id serial, sensor_type text not null, units varchar(10) not null, primary key(sensor_type_id) ); create table sensors ( sensor_id serial, sensor_type_id integer not null, serial char(16) not null, name text not null, low smallint not null, high smallint not null, active boolean default true, primary key(sensor_id), foreign key(sensor_type_id) references sensor_types(sensor_type_id) ); create unique index sensors_byserial on sensors(serial); create table samples ( ts datetime not null, sensor_id integer not null, sample float not null, foreign key(sensor_id) references sensors(sensor_id) ); create index samples_bytime on samples(ts); create unique index samples_bytimeid on samples(ts, sensor_id); -- END SCHEMA Each samples_[year] table looks, and is indexed exactly as the above samples table was by using the following commands: create index samples_1999_bytime on samples_1999(ts); create index samples_2000_bytime on samples_2000(ts); create index samples_2001_bytime on samples_2001(ts); create index samples_2002_bytime on samples_2002(ts); create index samples_2003_bytime on samples_2003(ts); create index samples_2004_bytime on samples_2004(ts); create unique index samples_1999_bytimeid on samples_1999(ts, sensor_id); create unique index samples_2000_bytimeid on samples_2000(ts, sensor_id); create unique index samples_2001_bytimeid on samples_2001(ts, sensor_id); create unique index samples_2002_bytimeid on samples_2002(ts, sensor_id); create unique index samples_2003_bytimeid on samples_2003(ts, sensor_id); create unique index samples_2004_bytimeid on samples_2004(ts, sensor_id); The tables contain the following number of rows: samples_1999311030 samples_20002142245 samples_20012706571 samples_20023111602 samples_20033149316 samples_20042375972 The following view creates the illusion of the old ``single-table'' model: create view samples as select * from samples_1999 union select * from samples_2000 union select * from samples_2001 union select * from samples_2002 union select * from samples_2003 union select * from samples_2004 ...along with the following rule on the view for the applications performing inserts: create rule sample_rule as on insert to samples do instead insert into samples_2004 (ts, sensor_id, sample) values(new.ts, new.sensor_id, new.sample) OK, now that that's over with, I have this one particular query that I attempt to run for a report from my phone that no longer works because it tries to do a table scan on *some* of the tables. Why it chooses this table scan, I can't imagine. The query is as follows: select s.serial as serial_num, s.name as name, date(ts) as day, min(sample) as min_temp, avg(sample) as avg_temp, stddev(sample) as stddev_temp, max(sample) as max_temp from samples inner join sensors s using (sensor_id) where ts current_date - 7 group by serial_num, name, day order by serial_num, day desc explain analyze reports the following (sorry for the horrible wrapping): Sort (cost=1185281.45..1185285.95 rows=1800 width=50) (actual time=82832.106..82832.147 rows=56 loops=1) Sort Key: s.serial, date(samples.ts) - HashAggregate (cost=1185161.62..1185184.12 rows=1800 width=50) (actual time=82830.624..82831.601 rows=56 loops=1) - Hash Join (cost=1063980.21..1181539.96 rows=206952 width=50) (actual time=80408.123..81688.590 rows=66389 loops=1) Hash Cond: (outer.sensor_id = inner.sensor_id) - Subquery Scan samples (cost=1063979.10..1155957.38 rows=4598914 width=20) (actual time=80392.477..80922.764 rows=66389 loops=1) - Unique (cost=1063979.10..1109968.24 rows=4598914 width=20) (actual time=80392.451..80646.761 rows=66389 loops=1) - Sort (cost=1063979.10..1075476.39 rows=4598914 width=20) (actual time=80392.437..80442.787 rows=66389 loops=1) Sort Key: ts, sensor_id, sample - Append (cost=0.00..312023.46 rows=4598914 width=20) (actual time=79014.428..80148.396 rows=66389 loops=1) - Subquery Scan *SELECT* 1 (cost=0.00..9239.37 rows=103677 width=20) (actual time=4010.181..4010.181 rows=0 loops=1)
Re: [PERFORM] inconsistent/weird index usage
Dustin Sallings wrote: [...] OK, now that that's over with, I have this one particular query that I attempt to run for a report from my phone that no longer works because it tries to do a table scan on *some* of the tables. Why it chooses this table scan, I can't imagine. The query is as follows: select s.serial as serial_num, s.name as name, date(ts) as day, min(sample) as min_temp, avg(sample) as avg_temp, stddev(sample) as stddev_temp, max(sample) as max_temp from samples inner join sensors s using (sensor_id) where ts current_date - 7 group by serial_num, name, day order by serial_num, day desc [ next section heavily clipped for clarity ] - Seq Scan on samples_1999 (cost rows=103677) (actual rows=0 loops=1) - Index Scan using samples_2000_bytime on samples_2000 (cost rows=714082 (actual rows=0 loops=1) - Seq Scan on samples_2001 (cost rows=902191) (actual rows=0 loops=1) - Seq Scan on samples_2002 (cost rows=1037201) (actual rows=0 loops=1) - Index Scan using samples_2003_bytime on samples_2003 (cost rows=1049772) (actual rows=0 loops=1) - Index Scan using samples_2004_bytime on samples_2004 (cost rows=791991) (actual rows=66389 loops=1) [...] Essentially, what you can see here is that it's doing an index scan on samples_2000, samples_2003, and samples_2004, but a sequential scan on samples_1999, samples_2001, and samples_2002. It's very strange to me that it would make these choices. If I disable sequential scans altogether for this session, the query runs in under 4 seconds. This is a very cool solution for long-term storage, and isn't terribly hard to manage. I actually have other report queries that seem to be making pretty good index selection currently...but I always want more! :) Does anyone have any suggestions as to how to get this to do what I want? Of course, ideally, it would ignore five of the tables altogether. :) -- SPY My girlfriend asked me which one I like better. pub 1024/3CAE01D5 1994/11/03 Dustin Sallings [EMAIL PROTECTED] |Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE L___ I hope the answer won't upset her. Just as a heads up. You have run vacuum analyze before running this query, correct? Because you'll notice that the query planner is thinking that it will have 103677 rows from 1999, 700,000 rows from 2000, 900,000 rows from 2001, etc, etc. Obviously the query planner is not planning well considering it there are only 60,000 rows from 2004, and no rows from anything else. It just seems like it hasn't updated it's statistics to be aware of when the time is on most of the tables. (By the way, an indexed scan returning 0 entries is *really* fast, so I wouldn't worry about ignoring the extra tables. :) I suppose the other question is whether this is a prepared or stored query. Because sometimes the query planner cannot do enough optimization in a stored query. (I ran into this problem where I had 1 column with 500,000+ entries referencing 1 number. If I ran manually, the time was much better because I wasn't using *that* number. With a stored query, it had to take into account that I *might* use that number, and didn't want to do 500,000+ indexed lookups) The only other thing I can think of is that there might be some collision between datetime and date. Like it is thinking it is looking at the time of day when it plans the queries (hence why so many rows), but really it is looking at the date. Perhaps a cast is in order to make it work right. I don't really know. Interesting problem, though. John =:- signature.asc Description: OpenPGP digital signature
Re: [PERFORM] inconsistent/weird index usage
Dustin Sallings [EMAIL PROTECTED] writes: The following view creates the illusion of the old ``single-table'' model: create view samples as select * from samples_1999 union select * from samples_2000 union select * from samples_2001 union select * from samples_2002 union select * from samples_2003 union select * from samples_2004 You really, really, really want to use UNION ALL not UNION here. OK, now that that's over with, I have this one particular query that I attempt to run for a report from my phone that no longer works because it tries to do a table scan on *some* of the tables. Why it chooses this table scan, I can't imagine. Most of the problem here comes from the fact that current_date - 7 isn't reducible to a constant and so the planner is making bad guesses about how much of each table will be scanned. If possible, do the date arithmetic on the client side and send over a simple literal constant. If that's not practical you can fake it with a mislabeled IMMUTABLE function --- see the list archives for previous discussions of the same issue. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] inconsistent/weird index usage
Dustin Sallings wrote: The following view creates the illusion of the old ``single-table'' model: create view samples as select * from samples_1999 union select * from samples_2000 union select * from samples_2001 union select * from samples_2002 union select * from samples_2003 union select * from samples_2004 Try this with UNION ALL (you know there won't be any duplicates) and possibly with some limits too: SELECT * FROM samples_1999 WHERE ts BETWEEN '1999-01-01 00:00:00+00' AND '1999-12-31 11:59:59+00' UNION ALL ... select s.serial as serial_num, s.name as name, date(ts) as day, min(sample) as min_temp, avg(sample) as avg_temp, stddev(sample) as stddev_temp, max(sample) as max_temp from samples inner join sensors s using (sensor_id) where ts current_date - 7 group by serial_num, name, day order by serial_num, day desc Try restricting the timestamp too WHERE ts BETWEEN (current_date -7) AND current_timestamp Hopefully that will give the planner enough smarts to know it can skip most of the sample_200x tables. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Caching of Queries
On Fri, Oct 01, 2004 at 06:43:42AM +0100, Matt Clark wrote: If you're not using a connection pool of some kind then you might as well forget query plan caching, because your connect overhead will swamp the planning cost. This does not mean you have to use something like pgpool (which makes some rather questionable claims IMO); any decent web application language/environment will support connection pooling. Hmm, a question of definition - there's a difference between a pool and a persistent connection. Pretty much all web apps have one connection per process, which is persistent (i.e. not dropped and remade for each request), but not shared between processes, therefore not pooled. OK, that'd work too... the point is if you're re-connecting all the time it doesn't really matter what else you do for performance. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: Where do you want to go today? Linux: Where do you want to go tomorrow? FreeBSD: Are you guys coming, or what? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Caching of Queries
OK, that'd work too... the point is if you're re-connecting all the time it doesn't really matter what else you do for performance. Yeah, although there is the chap who was asking questions on the list recently who had some very long-running code on his app servers, so was best off closing the connection because he had far too many postmaster processes just sitting there idle all the time! But you're right, it's a killer usually. M ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] inconsistent/weird index usage
Tom, Most of the problem here comes from the fact that current_date - 7 isn't reducible to a constant and so the planner is making bad guesses about how much of each table will be scanned. I thought this was fixed in 7.4. No? -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] inconsistent/weird index usage
Josh Berkus [EMAIL PROTECTED] writes: Most of the problem here comes from the fact that current_date - 7 isn't reducible to a constant and so the planner is making bad guesses about how much of each table will be scanned. I thought this was fixed in 7.4. No? No. It's not fixed as of CVS tip either, although there was some talk of doing something in time for 8.0. regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Caching of Queries
The context of the discussion was a hack to speed queries against static tables, so MVCC is not relevent. As soon as any work unit against a referenced table commits, the cache is invalid, and in fact the table shouldn't be a candidate for this caching for a while. In fact, this cache would reduce some the MVCC 'select count(*) from us_states' type of horrors. (The attraction of a server side cache is obviously that it could *with no server or app changes* dramatically improve performance. A materialized view is a specialized denormalization-ish mechanism to optimize a category of queries and requires the DBA to sweat the details. It is very hard to cache things stochastically without writing a server. Trigger managed extracts won't help you execute 1,000 programs issuing the query select sec_level from sec where division=23 each second or a big table loaded monthly.) - Original Message - From: Jeff [EMAIL PROTECTED] To: Mitch Pirtle [EMAIL PROTECTED] Cc: Aaron Werman [EMAIL PROTECTED]; Scott Kirkwood [EMAIL PROTECTED]; Neil Conway [EMAIL PROTECTED]; [EMAIL PROTECTED]; Tom Lane [EMAIL PROTECTED] Sent: Monday, September 27, 2004 2:25 PM Subject: Re: [PERFORM] Caching of Queries [ discussion of server side result caching ] and lets not forget PG's major fork it will throw into things: MVCC The results of query A may hold true for txn 1, but not txn 2 and so on . That would have to be taken into account as well and would greatly complicate things. It is always possible to do a poor man's query cache with triggers.. which would just leave you with basically a materialized view. -- Jeff Trout [EMAIL PROTECTED] http://www.jefftrout.com/ http://www.stuarthamm.net/ ---(end of broadcast)--- TIP 8: explain analyze is your friend ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Caching of Queries
People: Transparent query caching is the industry standard for how these things are handled. However, Postgres' lack of this feature has made me consider other approaches, and I'm starting to wonder if the standard query caching -- where a materialized query result, or some reduction thereof, is cached in database memory -- isn't the best way to cache things. I'm going to abbreviate it SQC for the rest of this e-mail. Obviously, the draw of SQC is its transparency to developers. With it, the Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you set it up, give it some RAM, and it just works. As someone who frequently has to consult based on limited knowledge, I can understand the appeal. However, one of the problems with SQC, aside from the ones already mentioned of stale data and/or cache-clearing, is that (at least in applications like MySQL's) it is indiscriminate and caches, at least breifly, unique queries as readily as common ones. Possibly Oracle's implementation is more sophisticated; I've not had an opportunity. The other half of that problem is that an entire query is cached, rather than just the relevant data to uniquely identify the request to the application. This is bad in two respects; one that the entire query needs to be parsed to see if a new query is materially equivalent, and that two materially different queries which could utilize overlapping ranges of the same underlying result set must instead cache their results seperately, eating up yet more memory. To explain what I'm talking about, let me give you a counter-example of another approach. I have a data-warehousing application with a web front-end.The data in the application is quite extensive and complex, and only a summary is presented to the public users -- but that summary is a query involving about 30 lines and 16 joins. This summary information is available in 3 slightly different forms. Further, the client has indicated that an up to 1/2 hour delay in data freshness is acceptable. The first step is forcing that materialized view of the data into memory. Right now I'm working on a reliable way to do that without using Memcached, which won't install on our Solaris servers. Temporary tables have the annoying property of being per-connection, which doesn't work in a pool of 60 connections. The second step, which I completed first due to the lack of technical obstacles, is to replace all queries against this data with calls to a Set-Returning Function (SRF). This allowed me to re-direct where the data was coming from -- presumably the same thing could be done through RULES, but it would have been considerably harder to implement. The first thing the SRF does is check the criteria passed to it against a set of cached (in a table) criteria with that user's permission level which is 1/2 hour old. If the same criteria are found, then the SRF is returned a set of row identifiers for the materialized view (MV), and looks up the rows in the MV and returns those to the web client. If no identical set of criteria are found, then the query is run to get a set of identifiers which are then cached, and the SRF returns the queried rows. Once I surmount the problem of storing all the caching information in protected memory, the advantages of this approach over SQC are several: 1) The materialized data is available in 3 different forms; a list, a detail view, and a spreadsheet. Each form as somewhat different columns and different rules about ordering, which would likely confuse an SQC planner. In this implementation, all 3 forms are able to share the same cache. 2) The application is comparing only sets of unambguous criteria rather than long queries which would need to be compared in planner form in order to determine query equivalence. 3) With the identifier sets, we are able to cache other information as well, such as a count of rows, further limiting the number of queries we must run. 4) This approach is ideally suited to the pagination and re-sorting common to a web result set. As only the identifiers are cached, the results can be re-sorted and broken in to pages after the cache read, a fast, all-in-memory operation. In conclusion, what I'm saying is that while forms of transparent query caching (plan, materialized or whatever) may be desirable for other reasons, it's quite possible to acheive a superior level of query caching through tight integration with the front-end application. If people are interested in this, I'd love to see some suggestions on ways to force the materialized view into dedicated memory. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Slow update/insert process
Pg: 7.4.5 RH 7.3 8g Ram 200 g drive space RAID0+1 Tables vacuum on a nightly basis The following process below takes 8 hours to run on 90k records and I'm not sure where to being to look for the bottleneck. This isn't the only updating on this database that seems to take a long time to complete. Is there something I should be looking for in my conf settings? TIA Patrick SQL: ---Bring back only selected records to run through the update process. --Without the function the SQL takes 10secs to return 90,000 records SELECT count(pm.pm_delta_function_amazon(upc.keyp_upc,'amazon')) FROM mdc_upc upc JOIN public.mdc_products prod ON upc.keyf_products = prod.keyp_products JOIN public.mdc_price_post_inc price ON prod.keyp_products = price.keyf_product JOIN public.mdc_attribute_product ap on ap.keyf_products = prod.keyp_products and keyf_attribute=22 WHERE upper(trim(ap.attributevalue)) NOT IN ('ESTEE LAUDER', 'CLINIQUE','ORGINS','PRESCRIPTIVES','LANC?ME','CHANEL','ARAMIS','M.A.C','TAG HEUER') AND keyf_producttype222 AND prod.action_publish = 1; Function: CREATE OR REPLACE FUNCTION pm.pm_delta_function_amazon(int4, varchar) RETURNS bool AS 'DECLARE varkeyf_upc ALIAS FOR $1; varPassword ALIAS FOR $2; varRealMD5 varchar; varDeltaMD5 varchar; varLastTouchDatedate; varQuery text; varQuery1 text; varQueryMD5 text; varQueryRecordrecord; varFuncStatusboolean := false; BEGIN -- Check the password IF varPassword \'amazon\' THEN Return false; END IF; -- Get the md5 hash for this product SELECT into varQueryRecord md5(upc.keyp_upc || prod.description || pm.pm_price_post_inc(prod.keyp_products)) AS md5 FROM public.mdc_upc upc JOIN public.mdc_products prod ON upc.keyf_products = prod.keyp_products JOIN public.mdc_price_post_inc price ON price.keyf_product = prod.keyp_products WHERE upc.keyp_upc = varkeyf_upc LIMIT 1 ; IF NOT FOUND THEN RAISE EXCEPTION \'varRealMD5 is NULL. UPC ID is %\', varkeyf_upc; ELSE varRealMD5:=varQueryRecord.md5; END IF; -- Check that the product is in the delta table and return its hash for comparison SELECT into varQueryRecord md5_hash,last_touch_date FROM pm.pm_delta_master_amazon WHERE keyf_upc = varkeyf_upc LIMIT 1; IF NOT FOUND THEN -- ADD and exit INSERT INTO pm.pm_delta_master_amazon (keyf_upc,status,md5_hash,last_touch_date) values (varkeyf_upc,\'add\',varRealMD5,CURRENT_DATE); varFuncStatus:=true; RETURN varFuncStatus; ELSE --Update the record --- If the hash matches then set the record to HOLD IF varRealMD5 = varQueryRecord.md5_hash THEN UPDATE pm.pm_delta_master_amazon SET status= \'hold\', last_touch_date = CURRENT_DATE WHERE keyf_upc = varkeyf_upc AND last_touch_date CURRENT_DATE; varFuncStatus:=true; ELSE -- ELSE mark the item as ADD UPDATE pm.pm_delta_master_amazon SET status= \'add\', last_touch_date = CURRENT_DATE WHERE keyf_upc = varkeyf_upc; varFuncStatus:=true; END IF; END IF; RETURN varFuncStatus; END;' LANGUAGE 'plpgsql' IMMUTABLE; TableDef CREATE TABLE pm.pm_delta_master_amazon ( keyf_upc int4 , status varchar(6) , md5_hash varchar(40) , last_touch_date date ) GO CREATE INDEX status_idx ON pm.pm_delta_master_amazon(status) GO CONF # WRITE AHEAD LOG #--- # - Settings - #fsync = true # turns forced synchronization on or off #wal_sync_method = fsync # the default varies across platforms: # fsync, fdatasync, open_sync, or open_datasync wal_buffers = 32 # min 4, 8KB each # - Checkpoints - checkpoint_segments = 50 # in logfile segments, min 1, 16MB each checkpoint_timeout = 600 # range 30-3600, in seconds #checkpoint_warning = 30 # 0 is off, in seconds #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5 # range 1-1000 Patrick Hatcher Macys.Com
Re: [PERFORM] Slow update/insert process
Some quick notes: - Using a side effect of a function to update the database feels bad to me - how long does theSELECT into varQueryRecord md5(upc.keyp function take / what does it's explain look like? - There are a lot of non-indexed columns on that delta master table, such as keyf_upc. I'm guessing you're doing 90,000 x {a lot of slow scans} - My temptation would be to rewrite the processing to do a pass of updates, a pass of inserts, and then the SELECT - Original Message - From: Patrick Hatcher To: [EMAIL PROTECTED] Sent: Friday, October 01, 2004 2:14 PM Subject: [PERFORM] Slow update/insert process Pg: 7.4.5 RH 7.3 8g Ram 200 g drive space RAID0+1 Tables vacuum on a nightly basis The following process below takes 8 hours to run on 90k records and I'm not sure where to being to look for the bottleneck. This isn't the only updating on this database that seems to take a long time to complete. Is there something I should be looking for in my conf settings? TIA Patrick SQL: ---Bring back only selected records to run through the update process. --Without the function the SQL takes 10secs to return 90,000 records SELECT count(pm.pm_delta_function_amazon(upc.keyp_upc,'amazon')) FROM mdc_upc upc JOIN public.mdc_products prod ON upc.keyf_products = prod.keyp_products JOIN public.mdc_price_post_inc price ON prod.keyp_products = price.keyf_product JOIN public.mdc_attribute_product ap on ap.keyf_products = prod.keyp_products and keyf_attribute=22 WHERE upper(trim(ap.attributevalue)) NOT IN ('ESTEE LAUDER', 'CLINIQUE','ORGINS','PRESCRIPTIVES','LANC?ME','CHANEL','ARAMIS','M.A.C','TAG HEUER') AND keyf_producttype222 AND prod.action_publish = 1; Function: CREATE OR REPLACE FUNCTION pm.pm_delta_function_amazon(int4, "varchar")RETURNS bool AS'DECLARE varkeyf_upc ALIAS FOR $1; varPassword ALIAS FOR $2; varRealMD5 varchar; varDeltaMD5 varchar; varLastTouchDate date; varQuery text; varQuery1 text; varQueryMD5 text; varQueryRecordrecord; varFuncStatusboolean := false; BEGIN-- Check the passwordIF varPassword \'amazon\' THEN Return false;END IF;-- Get the md5 hash for this productSELECT into varQueryRecord md5(upc.keyp_upc || prod.description || pm.pm_price_post_inc(prod.keyp_products)) AS md5 FROM public.mdc_upc upc JOIN public.mdc_products prod ON upc.keyf_products = prod.keyp_products JOIN public.mdc_price_post_inc price ON price.keyf_product = prod.keyp_products WHERE upc.keyp_upc = varkeyf_upc LIMIT 1 ;IF NOT FOUND THEN RAISE EXCEPTION \'varRealMD5 is NULL. UPC ID is %\', varkeyf_upc;ELSE varRealMD5:=varQueryRecord.md5;END IF;-- Check that the product is in the delta table and return its hash for comparison SELECT into varQueryRecord md5_hash,last_touch_date FROM pm.pm_delta_master_amazon WHERE keyf_upc = varkeyf_upc LIMIT 1;IF NOT FOUND THEN -- ADD and exit INSERT INTO pm.pm_delta_master_amazon (keyf_upc,status,md5_hash,last_touch_date) values (varkeyf_upc,\'add\',varRealMD5,CURRENT_DATE); varFuncStatus:=true; RETURN varFuncStatus;ELSE --Update the record --- If the hash matches then set the record to HOLD IF varRealMD5 = varQueryRecord.md5_hash THEN UPDATE pm.pm_delta_master_amazon SET status= \'hold\', last_touch_date = CURRENT_DATE WHERE keyf_upc = varkeyf_upc AND last_touch_date CURRENT_DATE; varFuncStatus:=true; ELSE -- ELSE mark the item as ADD UPDATE pm.pm_delta_master_amazon SET status= \'add\', last_touch_date = CURRENT_DATE WHERE keyf_upc = varkeyf_upc; varFuncStatus:=true; END IF; END IF;RETURN varFuncStatus;END;'LANGUAGE 'plpgsql' IMMUTABLE;TableDef CREATE TABLE pm.pm_delta_master_amazon ( keyf_upc int4 , status varchar(6) , md5_hashvarchar(40) , last_touch_date date ) GO CREATE INDEX status_idx ON pm.pm_delta_master_amazon(status) GO CONF # WRITE AHEAD LOG #--- # - Settings - #fsync = true # turns forced synchronization on or off #wal_sync_method = fsync# the default varies across platforms: # fsync, fdatasync, open_sync, or open_datasync wal_buffers = 32 # min 4, 8KB each # - Checkpoints - checkpoint_segments = 50# in logfile segments, min 1, 16MB each checkpoint_timeout = 600# range 30-3600, in seconds #checkpoint_warning = 30 # 0 is off, in seconds #commit_delay = 0 # range 0-10, in microseconds #commit_siblings = 5 # range 1-1000 Patrick HatcherMacys.Com
Re: [PERFORM] Caching of Queries
I'm not sure I understand your req fully. If the same request is repeatedly done with same parameters, you could implement a proxy web server with a croned script to purge stale pages. If there is substantially the same data being summarized, doing your own summary tables works; if accessed enough, they're in memory. I interleaved some notes into your posting. - Original Message - From: Josh Berkus [EMAIL PROTECTED] To: Postgresql Performance [EMAIL PROTECTED] Sent: Friday, October 01, 2004 1:10 PM Subject: Re: [PERFORM] Caching of Queries People: Transparent query caching is the industry standard for how these things are handled. However, Postgres' lack of this feature has made me consider other approaches, and I'm starting to wonder if the standard query caching -- where a materialized query result, or some reduction thereof, is cached in database memory -- isn't the best way to cache things. I'm going to abbreviate it SQC for the rest of this e-mail. Obviously, the draw of SQC is its transparency to developers. With it, the Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you set it up, give it some RAM, and it just works. As someone who frequently has to consult based on limited knowledge, I can understand the appeal. My sense is that pg is currently unique among popular dbmses in having the majority of applications being homegrown (a chicken / egg / advocacy issue - if I install a CMS, I'm not the DBA or the PHP programmer - and I don't want to change the code; we'll see more about this when native WinPg happens). However, one of the problems with SQC, aside from the ones already mentioned of stale data and/or cache-clearing, is that (at least in applications like MySQL's) it is indiscriminate and caches, at least breifly, unique queries as readily as common ones. Possibly Oracle's implementation is more sophisticated; I've not had an opportunity. I'm not sure I agree here. Stale data and caching choice are optimizer/buffer manager choices and implementation can decide whether to allow stale data. These are design choices involving development effort and choices of where to spend server cycles and memory. All buffering choices cache unique objects, I'm not sure why this is bad (but sensing you want control of the choices). FWIW, this is my impression of other dbmses. In MySQL, a global cache can be specified with size and globally, locally, or through statement hints in queries to suggest caching results. I don't believe that these could be used as common subexpressions (with an exception of MERGE table component results). The optimizer knows nothing about the cached results - SQL select statements are hashed, and can be replaced by the the cached statement/results on a match. In DB2 and Oracle result sets are not cached. They have rich sets of materialized view features (that match your requirements). They allow a materialized view to be synchronous with table updates or asynchronous. Synchronous is often an unrealistic option, and asynchronous materialized views are refreshed at a specified schedule. The optimizers allow query rewrite (in Oracle it is a session option) so one can connect to the database and specify that the optimizer is allowed to replace subexpressions with data from (possibly stale) materialized views. SQL Server 2K has more restrictive synchronous MVs, but I've never used them. So, in your example use in Oracle, you would need to define appropriate MVs with a ½ hour refresh frequency, and hope that the planner would use them in your queries. The only change in the app is on connection you would allow use of asynchronous stale data. You're suggesting an alternative involving identifying common, but expensive, subexpressions and generating MVs for them. This is a pretty sophisticated undertaking, and probably requires some theory research to determine if it's viable. The other half of that problem is that an entire query is cached, rather than just the relevant data to uniquely identify the request to the application. This is bad in two respects; one that the entire query needs to be parsed to see if a new query is materially equivalent, and that two materially different queries which could utilize overlapping ranges of the same underlying result set must instead cache their results separately, eating up yet more memory. There are two separate issues. The cost of parse/optimization and the cost of results retrieval. Other dbmses hash statement text. This is a good thing, and probably 3 orders of magnitude faster than parse and optimization. (Oracle also has options to replace literals with parameters and match parse trees instead of text, expecting parse costs to be less than planning costs.) MySQL on a match simply returns the result set. Oracle and DB2 attempt to rewrite queries to use the DBA selected extracts. The MySQL approach seems to be almost what you're describing: all it needs