Re: [PERFORM] timestamp indexing
[Jim C. Nasby - Thu at 01:04:53PM -0500] > What does > > SET enable_seqscan = false; > EXPLAIN ANALYZE SELECT * FROM ... > > get you? Is it faster? I was experimenting with this some weeks ago, by now our database server has quite low load numbers and I haven't gotten any complaints about anything that is too slow, so I have temporary stopped working with this issue - so I will not contribute with any more gory details at the moment. :-) I concluded with that our "problem" is that we (for performance reasons) store aggregated statistics in the "wrong" tables, and since updating a row in pg effectively means creating a new physical row in the database, the rows in the table are not in chronological order. If "last months activity" presents like 7% of the rows from the table is to be fetched, the planner will usually think that a seq scan is better. As time pass by and the table grows, it will jump to index scans. The "old" stuff in the database eventually grow historical, so the aggregated statistics will not be updated for most of those rows. Hence a forced index scan will often be a bit faster than a suggested table scan. I experimented, and doing an index scan for the 3rd time would usually be faster than doing a full table scan for the 3rd time, but with things not beeing in cache, the planner was right to suggest that seq scan was faster due to less disk seeks. The long term solution for this problem is to build a separate data warehouse system. The short time solution is to not care at all (eventually, buy more memory). As long as the queries is on the form "give me everything since last monday", it is at least theoretically possible to serve this through partial indices, and have a cronjob dropping the old indices and creating new every week. Doing table clustering night time would probably also be a solution, but I haven't cared to test it out yet. I'm a bit concerned about performance/locking issues. -- Tobias Brox, +47-91700050 Tallinn, Europe ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] timestamp indexing
What does SET enable_seqscan = false; EXPLAIN ANALYZE SELECT * FROM ... get you? Is it faster? BTW, I suspect this behavior is because the estimates for the cost of an index scan don't give an appropriate weight to the correlation of the index. The 'sort and index' thread on this list from a few months ago has more info. -- Jim C. Nasby, Database Consultant [EMAIL PROTECTED] Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?" ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] timestamp indexing
[Tobias Brox - Tue at 10:06:25AM +0800] > [Tom Lane - Mon at 01:57:54PM -0400] > > Your followup hasn't shown up here yet, > > I'll check up on that and resend it. Hrm ... messed-up mail configuration I suppose. Here we go: Paul McGarry unintentionally sent a request for more details off the list, since it was intended for the list I'll send my reply here. While writing up the reply, and doing research, I discovered that this is not a problem with indexing timestamps per se, but more with a query of the kind "give me 5% of the table"; it seems like it will often prefer to do a full table scan instead of going via the index. I think that when I had my university courses on databases, we also learned about flat indexes, where the whole index has to be rebuilt whenever a field is updated or inserted in the middle, and I also think we learned that the table usually would be sorted physically by the primary key on the disk. As long as we have strictly incrementing primary keys and timestamps, such a setup would probably be more efficient for queries of the kind "give me all activity for the last two weeks"? Here follows my reply to Paul, including some gory details: [Paul McGarry - Mon at 07:59:35PM +1000] > What version of postgresql are you using and what are the exact > datatypes and queries? We are still using 7.4.6, but I suppose that if our issues are completely or partially solved in pg 8, that would make a good case for upgrading :-) The datatypes I'm indexing are timestamp without time zone. Actually I may be on the wrong hunting ground now - the production system froze completely some days ago basically due to heavy iowait and load on the database server, rendering postgresql completely unresponsive - and back then we had too poor logging to find out what queries that was causing it to grind to a halt, and since we've never such a bad problem before, we didn't know how to handle the situation (we just restarted the entire postgresql; if we had been just killing the processes running the rogue database queries, we would have had very good tracks of it in the logs). I digress. The last days I've looked through profiling logs, and I'm checking if the accumulatively worst queries can be tuned somehow. Most of them are big joins, but I'm a bit concerned of the amounts of "Seq Scan" returned by "explain" despite the fact that only a small fraction of the tables are queried. I reduced the problem to a simple "select * from table where created>xxx" and discovered that it still won't use index, and still will be costly (though of course not much compared to the big joined query). The "ticket" table have less than a million rows, around 50k made the last ten days: NBET=> explain analyze select * from ticket where created>'2005-05-20'; QUERY PLAN -- Seq Scan on ticket (cost=0.00..19819.91 rows=89553 width=60) (actual time=535.884..1018.268 rows=53060 loops=1) Filter: (created > '2005-05-20 00:00:00'::timestamp without time zone) Total runtime: 1069.514 ms (3 rows) Anyway, it seems to me that "indexing on timestamp" is not the real issue here, because when restricting by primary key (numeric, sequential ID) the execution time is the same or worse, still doing a sequence scan: NBET=> explain analyze select * from ticket where id>711167; QUERY PLAN -- Seq Scan on ticket (cost=0.00..19819.91 rows=92273 width=60) (actual time=550.855..1059.843 rows=53205 loops=1) Filter: (id > 711167) Total runtime: 1110.469 ms (3 rows) I've tried running equivalent queries on a table with twice as many rows and width=180, it will pull from the index both when querying by ID and timestamp, and it will usually spend less time. Running "select * from ticket" seems to execute ~2x slower than when having the restriction. > I have a 7.3 database with a "timestamp with time zone" field and we > have to be very careful to explicitly cast values as that in queries > if it is to use the index correctly. I believe it's an issue that is > cleared up in newer versions though. I suppose so - as said, restricting by primary key didn't improve the performance significantly, so I was clearly wrong indicating that this is a special issue with indexing a timestamp. -- Tobias Brox, Beijing ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] timestamp indexing
[Tom Lane - Mon at 01:57:54PM -0400] > Your followup hasn't shown up here yet, I'll check up on that and resend it. > but if the query is written like > WHERE timestampcol >= now() - interval 'something' > then the pre-8.0 planner is not capable of making a good estimate of the > selectivity of the WHERE clause. > One solution is to fold the timestamp > computation to a constant on the client side. I don't think there are any of that in the production; we always make the timestamps on the client side. As to my original problem, I looked up on table clustering on google. Right, for report performance, we store some aggregates in the table which are updated several times. If I've understood it correctly, the row will physically be moved to the tail of the table every time the attribute is updated. I understand that it may make sense to do a full table scan if a random 10% of the rows should be selected. Forcing the usage of the index caused a tiny improvement of performance, but only after running it some few times to be sure the index got buffered :-) -- Tobias Brox, Beijing ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] timestamp indexing
Tobias Brox <[EMAIL PROTECTED]> writes: >> What version of PostgreSQL are you using? > Also answered in my follow-up - "not yet pg8" :) Your followup hasn't shown up here yet, but if the query is written like WHERE timestampcol >= now() - interval 'something' then the pre-8.0 planner is not capable of making a good estimate of the selectivity of the WHERE clause. One solution is to fold the timestamp computation to a constant on the client side. regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] timestamp indexing
[Michael Fuhr - Mon at 07:54:29AM -0600] > The message subject is "timestamp indexing" but you don't mention > whether you have an index on the timestamp column. Do you? Yes. Sorry for not beeing explicit on that. > Could you post an example query and its EXPLAIN ANALYZE output? If > the query uses a sequential scan then it might also be useful to see > the EXPLAIN ANALYZE output with enable_seqscan turned off. Since > caching can cause a query to be significantly faster after being run > several times, it might be a good idea to run EXPLAIN ANALYZE three > times and post the output of the last run -- that should put the > queries under comparison on a somewhat equal footing (i.e., we don't > want to be misled about how much faster one query is than another > simply because one query happened to use more cached data on a > particular run). The actual statement was with 6 or 7 joins and very lengthy. I reduced it to a simple single join query which still did a sequential scan rather than an index scan (as expected), and I believe I already did a follow-up mail including "explain analyze". All "explain analyze" in my previous mail was run until the resulting execution time had stabilized, relatively. I will try with "set enable_seqscan off" when I get back to the office. > How many records are in the tables you're querying? Also answered on in my follow-up. > Are you regularly > vacuuming and analyzing the database or the individual tables? Vacuum is run nightly, and I also did a manual "vacuum analyze table" on the table in question. > Are > any of the tables clustered? If so, on what indexes and how often > are you re-clustering them? Huh? :) > What version of PostgreSQL are you using? Also answered in my follow-up - "not yet pg8" :) ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] timestamp indexing
On Mon, May 30, 2005 at 05:19:51PM +0800, Tobias Brox wrote: > > We have a production database with transaction-style data, in most of the > tables we have a timestamp attribute "created" telling the creation time of > the table row. Naturally, this attribute is always increasing. The message subject is "timestamp indexing" but you don't mention whether you have an index on the timestamp column. Do you? > By now we are hitting the limit where the table data does not fit in caches > anymore. We have a report section where there are constantly requests for > things like "sum up all transactions for the last two weeks", and those > requests seem to do a full table scan, even though only the last parts of > the table is needed - so by now those reports have started to cause lots of > iowait. Could you post an example query and its EXPLAIN ANALYZE output? If the query uses a sequential scan then it might also be useful to see the EXPLAIN ANALYZE output with enable_seqscan turned off. Since caching can cause a query to be significantly faster after being run several times, it might be a good idea to run EXPLAIN ANALYZE three times and post the output of the last run -- that should put the queries under comparison on a somewhat equal footing (i.e., we don't want to be misled about how much faster one query is than another simply because one query happened to use more cached data on a particular run). How many records are in the tables you're querying? Are you regularly vacuuming and analyzing the database or the individual tables? Are any of the tables clustered? If so, on what indexes and how often are you re-clustering them? What version of PostgreSQL are you using? -- Michael Fuhr http://www.fuhr.org/~mfuhr/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings