[HACKERS] Index scan ordering (performance)
Hi all, I'm looking at ways to improve PG performance where index scans are concerned. The layout of our database and our access habits really hit PG's index code square in the chops. Many of our queries spend way too much time waiting for disk seeks, and I believe this is fully caused by the way PG does index scans. In the TODO list is an entry Order duplicate index entries by tid for faster heap lookups. I think this change would prove *very* beneficial for us. Is anybody working on this? Is it considered important by many people? Ordering duplicate index entries would obviously help for... uh... duplicate entries, but what about non-duplicates being accessed in a range scan? If the executer were to handle the ordering at read time, it would work for ranges as well. Would it be wildly difficult to have the executer scan more than one index entry before doing the related heap accesses? That way you could read X number of index entries (or all of them), order the resulting list on tid, and then go after the heap. It seems that the only side effect from this would be tid-ordered default sorting rather than index-ordered default sorting (which does not effect SQL compliance). Any ideas what status is on this stuff? Should I dive in and try to do it myself? Glen Parker [EMAIL PROTECTED] ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
FW: [HACKERS] Do we prefer software that works or software that looks good?
Tom Lane wrote: Personally I don't think that this is a transitional issue and we will someday all be happy in upper-case-only-land. Upper-case-only sucks, by every known measure of readability, and I don't want to have to put up with a database that forces that 1960s-vintage-hardware mindset on me. I think the SQL standard is screwy here on at least two levels. Not only is upper case fuggly (we all seem to agree on that point), but I think case folding is a Bad Idea in general. I think the only time you should have to quote a DB identifier is when it conflicts with a reserved word. Either be case sensative or don't. I'm all for the (ignore but preserve case) way of doing things. But it IS the standard, and as such, as much as we all seem to dislike it, I believe it is better to follow it. You can't just go around picking and choosing the standards you'll adhere to. Like Microsoft. If it bothers you that much, put some effort into changing it. Attain world domination and then force the world to bend to The Right Way. Get rich and pay off enough members of the standards body to get it changed. But until then, live with it. Now, I am all for configurability, and lots of it. By all means, allow us to choose how we'd like case folding to be carried out, or whether case folding (blech) is done at all. While you're at it, allow us to choose whether NULL is treated as zero/blank/empty or as SQL standard NULL. Allow us to force the DB to do case-insensative comparisons on all character data. Allow us, as DB admins, to f*** with the standard behavior until we have a working mimic of MySQL or MS-SQL :-) But I think the default behavior should adhere to the SQL standard as closely as possible, even when we all hate it with a passion. Just my $.02 Glen Parker ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PITR Phase 1 - Test results
I want to come hug you --- where do you live? !!! You're not the only one. But we don't want to smother the poor guy, at least not before he completes his work :-) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
The basic problem is that Pg seeks far too much when performing an index scan. I have saved an strace of a backend which is selecting 170,000 rows from a 240,000,000 row table via index scan. The strace shows 134,000 seeks and reads, or almost one seek for every tuple in the result set. This would be fine normally except the seeks are in a *very* suboptimal pattern, swinging back and forth across the device. The query requires 16 minutes, 30 seconds on our test hardware. Yes yes yes *please* fix this :-) This performance bottle neck in PG effects us all the time. The proposal is to sort the block requests before they are issued. Because Pg only issues single seek/read pairs synchronously, the kernel has no chance to apply its elevator and hence every seek/read Pg issues results in actual movement of the disk head. Pg's random pattern of seeking also makes the kernel's readahead fairly useless. To prove the proposal I did a simulation, using the recorded seek positions in the strace. We noted that Pg issued 403 seek/read pairs for every 8192 bytes read from the index. So we performed four simulations: a straight playback of Pg's seek pattern, a playback where each list of 403 seeks is sorted by byte offset, a playback of all the seeks sorted by offset, and a sequential scan of the 13GB data file. PostgreSQL 4.2.1: 16m30s Sorted in chunks: 10m6s Sorted altogether: 4m19s Sequential scan: 6m7s As you can see, there's a lot to be gained here. If Pg was able to issue its seeks in the optimal order, the query would return in 1/4th the time. Even the chunk-sorted scheme is a big win. I think your worst case may not be as bad as it gets. Nothing scientific, but my experience tells me that it's common to get even worse performance than that. I've had queries that would take seemingly forever, but then after a fresh cluster and cache flush, the same query would take almost no time at all. I also think that with a multi-mirrored RAID setup and your proposed IO sorting, the mirroring would be more likely to overcome seek overhead. With the current implementation, it seems almost useless to throw more hardware at the problem. I think the improvement would be even 'huger' than your numbers show :-) So the proposal is this: the offsets stored in the index ought to be sorted. Either A) they can be stored sorted in the first place (sorted in VACUUM ANALYZE, or always sorted), or B) the backend can sort each list of tuples it reads from the index, or C) the backend can read the entire index result and sort that (this would be the best). From just paging through the code, it doesn't seem terribly hard. B seems the easiest to implement but is also the least improvement. Even seqscan is better than B. One improvement to B would be to read much more than 8K from the index. Reading e.g. 256KB would improve things dramatically. A might be easy but would either degrade over time or cause slower inserts. C is the best and hardest to implement (I am guessing, because the size of sorted index subset is unbounded). IMO if you're going to do it, do it right. That means A is pretty much out, again, IMO. It would seem that B (improved) would be the way to go because, as you say, C would produce unbounded subsets. I would propose to read very large sections of the index subset though, more in the order of several MB (configurable, of course). With that much space to play with, most queries would require only one pass at the index and therefore show performance equal to C. Larger queries would suffer a bit, but then we don't usually have such speedy expectations of large expected result sets, and again, with enough sort space to play with, the performance could approach that of C. I would think that you could also sort the index on index order (during vacuum) to improve index scan performance. This could be a completely seperate implementation task. With clean sequential index access *and* clean sequential heap access, it might just prove to be the single largest performance boost PG has seen, well, ever. My .02... Glen Parker ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane For starters, read the previous discussions of this in the archives. Search doesn't work. Even if it did, I'm not sure what I would search on. Do you remember the time frame of the discussion? If I could narrow it down to a few months, I could possibly find it the slow way. I'd be very interested to read it. Anyway, this subject seems to get precious little attention, and I don't understand why. PG's index scan implementation is so bad at some types of queries, I find it surprising that there aren't weekly discussions about it, and developers lined up around the block to work on it. I would be one of those developers, but frankly the learning curve is pretty steep and I don't have any smaller complaints to use as learning tools. What am I missing? Why is a performance bottle neck of this magnitude not on the same list of priorities as PITR, replication, and Win32? Two questions you should have answers to before starting to implement, rather than after, are how to deal with locking considerations and what will be the implications of giving up the property that indexscans deliver sorted output. Here's one answer: If you had to sort every result set, even when an index could have been used, overall performance would still improve by a very large margin. I'd bet money on it. Actually, the index would still have to be scanned in sort order. Only the fetch order for heap tuples would be sorted differently. Glen Parker ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane Sent: Wednesday, May 19, 2004 11:56 AM Not unless you add yet another sort step after the fetch step, that is the idea becomes 1. read index to get candidate TIDs 2. sort TIDs into physical order 3. read heap in physical order, check row visibility 4. sort selected rows into index order That would start to look like an unpleasant amount of overhead, since the sort would have to be over the entire output; you couldn't divide the scan into reasonable-size batches, whereas steps 1-3 can easily be run in batched style. Or: 2. Sort AND COPY TID's into physical order. 3. Read heap in phy. order, match results to un-sorted TID list. That sounds quite cheap. Then you get to drop step 4 (which would *usually* be quite a bit more expensive than a TID sort and copy?). The cost of the proposed implementation would be higher *when everything is in the cache*, granted. As a user, I will take that cost 10 times over in return for such a large improvement in the no-cache situation. -Glen ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] proposal: be smarter about i/o patterns in index scan
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Tom Lane Or: 2. Sort AND COPY TID's into physical order. 3. Read heap in phy. order, match results to un-sorted TID list. That sounds quite cheap. No, it sounds like handwaving. What's your match results step, and You got me :-) does it take less than O(N^2) time? How do you get to *return* the Less than O(N^2)??? Geez I think so! tuples in index order, without having read them all before you return the first one? (Which is really what the problem is with the sort alternative, at least for fast-start cases such as the LIMIT 1 example.) (you'll have to pardon me for thinking as I type... and being to long winded :-) It takes O(1) time. If you have a hard maximum of TID's you'll grab from the index (which sounds right), you could store the TID list in a vector. The IO-order-sorted list could be a singly-linked-list OR a vector, but either way, each entry would have to point back to an offset in the index-ordered list. The index-ordered list would be (sizeof(TID) * max_TID_count) bytes, and the IO-ordered list would be (sizeof(TID) + sizeof(int) * max_TID_count) bytes. If a TID is 8 bytes and we're going to fetch 2048 index entries per iteration, that gets us (*counting on fingers*) 16K (for index-ordered list) + 24K for the IO-ordered list. The index-ordered list can also be a singly-linked list, in which case the mapping would be by pointer. If both lists are single-linked lists, then the size overhead (for a fully realized 2048 entry scan) would be: ((sizeof(TID) + sizeof(void*)) * max_TID_count) + ((sizeof(TID) + sizeof(void*) + sizeof(void*)) * max_TID_count) = (*more counting on fingers*) ... 24K + 32K = 56K. Hmm, the vector style would be crazy expensive for small queries. The linked-list approach sounds pretty good though. I assume no one thinks doing this would be completely free :-) And, if you did the IO-ordered list as a vector, you'd save the linked-list overhead, and you would know exactly how big to make it, so it wouldn't have to hurt you on small queries. 2048 entries seems pretty darn generous actually. How do you get to *return* the tuples in index order, without having read them all before you return Hmm, worsed case scenario would be to scan the heap in exactly-reverse order of index order. With a 2048 entry batch size, you'd have a fair amount of overhead on a LIMIT 1 :-) Unless the index scanner could be given a maximum tuple count, rather than just 'knowing' it from a GUC value. Would this dirty up the plan tree beyond all recognition? What happens when you run out of memory for the list of TIDs ... er, make that two lists of TIDs? Now you really got me. Hand waving: the same thing you do when you run out of memory anywhere else :-) By configuring the server to fetch index entries in 1-entry batches, you'd be right back to the overhead (or very nearly so) of the current implementation, so it would only hurt people who chose to be hurt by it :-) -Glen ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] Why frequently updated tables are an issue
It has been suggested in past to add such a visibility to index tuple header so that index and heaps can be cleaned out of order. In such a case other backround It seems to me that the benefit of this wouldn't be all that impressive *when accessing the cache*, which is the problem this discussion is about. Disk access would occur more commonly with large tables, which I'll ignore. Let's say total scan time for a query on a very dirty table is 100ms. It seems safe to assume that the scan time for the index would be *roughly* half that of the heap. If visibilty could be determined by looking at just the index tuple, you'd cut you query scan time down to 50ms. When the clean table case is 7ms total scan time, the difference between 50 and 100 ms is not much of an issue; either way, it's still way to high! However increasing index footprint seems to be a tough sell. And rightly so, IMO. Glen Parker ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Point in Time Recovery
Simon Riggs wrote: On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote: Thanks for the vote of confidence, on or off list. too many people spend a lot of money for proprietary databases, just for some missing features in PostgreSQL Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it already, even though it doesn't have it, then you are quite likely to be able to keep going without it. Most commercial users won't touch anything that doesn't have PITR. Agreed. I am surprised at how few requests we have gotten for PITR. I assume people are either using replication or not considering us. Don't forget that there are (must be) lots of us that know it's coming and are just waiting until it's available. I haven't requested per se, but believe me, I'm waiting for it :-) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal for Implenting read-only queries during wal replay (SoC 2007)
I'll throw in my vote, I would find this quite useful. -Glen Florian G. Pflug wrote: I plan to submit a proposal for implementing support for read-only queries during wal replay as a Google Summer of Code 2007 project. I've been browsing the postgres source-code for the last few days, and came up with the following plan for a implementation. I'd be very interested in any feedback on the propsoal - especially of the you overlooked this an that, it can never work that way kind ;-) ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] more than one index in a single heap pass?
Andrew Dunstan wrote: Well, yes, it's some of it, and in theory Tom's late addition of a queue that gets all the dependencies of a table as soon as the table data is restored should make that work better. But of course, that's not the only time indexes are created, and each index creation command will be doing its own heap processing, albeit that synchronised scanning will make that lots cheaper. As I said originally, it was just an idle thought that came to me today. Sounds to me like another reason to separate index definition from creation. If an index can be defined but not yet created or valid, then you could imagine syntax like: DEFINE INDEX blahblah1 ON mytable (some fields); DEFINE INDEX blahblah2 ON mytable (some other fields); [RE]INDEX TABLE mytable; ...provided that REINDEX TABLE could recreate all indexes simultaneously as you suggest. -Glen -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers