[HACKERS] Preliminary notes about hash index concurrency (long)
I've been looking at fixing the problem reported a few days ago whereby a bucket split in a hash index messes up the state of concurrent scans of the index, possibly causing some tuples to be missed by the scans. AFAICS the only way to fix this is to prevent such a concurrent split. Accordingly, I've been trying to redesign the hash index locking mechanisms to make that possible, and while I'm at it eliminate the various internal deadlock risks that presently exist in hash indexes. Attached are some design notes --- any comments? regards, tom lane Lock definitions We use both lmgr locks (heavyweight locks) and buffer context locks (LWLocks) to control access to a hash index. lmgr locks are needed for long-term locking since there is a (small) risk of deadlock, which we must be able to detect. Buffer context locks are used for short-term access control to individual pages of the index. We define the following lmgr locks for a hash index: LockPage(rel, 0) represents the right to modify the hash-code-to-bucket mapping. A process attempting to enlarge the hash table by splitting a bucket must exclusive-lock this lock before modifying the metapage data representing the mapping. Processes intending to access a particular bucket must share-lock this lock until they have acquired lock on the correct target bucket. LockPage(rel, page), where page is the page number of a hash bucket page, represents the right to split or compact an individual bucket. A process splitting a bucket must exclusive-lock both old and new halves of the bucket until it is done. A process doing VACUUM must exclusive-lock the bucket it is currently purging tuples from. Processes doing scans or insertions must share-lock the bucket they are scanning or inserting into. (It is okay to allow concurrent scans and insertions.) The lmgr lock IDs corresponding to overflow pages are currently unused. These are available for possible future refinements. Note that these lock definitions are conceptually distinct from any sort of lock on the pages whose numbers they share. A process must also obtain read or write buffer lock on the metapage or bucket page before accessing said page. Processes performing hash index scans must hold share lock on the bucket they are scanning throughout the scan. This seems to be essential, since there is no reasonable way for a scan to cope with its bucket being split underneath it. This creates a possibility of deadlock external to the hash index code, since a process holding one of these locks could block waiting for an unrelated lock held by another process. If that process then does something that requires exclusive lock on the bucket, we have deadlock. Therefore the bucket locks must be lmgr locks so that deadlock can be detected and recovered from. This also forces the page-zero lock to be an lmgr lock, because as we'll see below it is held while attempting to acquire a bucket lock, and so it could also participate in a deadlock. Processes must obtain read (share) buffer context lock on any hash index page while reading it, and write (exclusive) lock while modifying it. To prevent deadlock we enforce these coding rules: no buffer lock may be held long term (across index AM calls), nor may any buffer lock be held while waiting for an lmgr lock, nor may more than one buffer lock be held at a time by any one process. (The third restriction is probably stronger than necessary, but it makes the proof of no deadlock obvious.) Pseudocode algorithms - The operations we need to support are: readers scanning the index for entries of a particular hash code (which by definition are all in the same bucket); insertion of a new tuple into the correct bucket; enlarging the hash table by splitting an existing bucket; and garbage collection (deletion of dead tuples and compaction of buckets). Bucket splitting is done at conclusion of any insertion that leaves the hash table more full than the target load factor, but it is convenient to consider it as an independent operation. Note that we do not have a bucket-merge operation --- the number of buckets never shrinks. Insertion, splitting, and garbage collection may all need access to freelist management, which keeps track of available overflow pages. The reader algorithm is: share-lock page 0 (to prevent active split) read/sharelock meta page compute bucket number for target hash key release meta page share-lock bucket page (to prevent split/compact of this bucket) release page 0 share-lock -- then, per read request: read/sharelock current page of bucket step to next page if necessary (no chaining of locks) get tuple release current page -- at scan shutdown: release bucket share-lock By holding the page-zero lock until lock on the target bucket is obtained, the reader ensures that the target bucket
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Tom Lane wrote: I've been looking at fixing the problem reported a few days ago whereby a bucket split in a hash index messes up the state of concurrent scans of the index, possibly causing some tuples to be missed by the scans. AFAICS the only way to fix this is to prevent such a concurrent split. Accordingly, I've been trying to redesign the hash index locking mechanisms to make that possible, and while I'm at it eliminate the various internal deadlock risks that presently exist in hash indexes. Attached are some design notes --- any comments? Seems you are adding locking similar to what we already do in btree. I know we have two sets of hash codes -- the one used for hash indexes, and another used for hash joins and now aggregates and subqueries. I assume these changes are for hash indexes. I know someone reported a problem with the hash indexes (data loss, serious)--- was that a new 7.4 but or something that has existed for a long time? When were you considering making these changes? -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: I've been looking at fixing the problem reported a few days ago whereby a bucket split in a hash index messes up the state of concurrent scans of the index, possibly causing some tuples to be missed by the scans. Seems you are adding locking similar to what we already do in btree. Not adding locking --- hash already has locking. The problem is the locking is wrong (both inadequate and deadlock prone :-(). I know we have two sets of hash codes -- the one used for hash indexes, and another used for hash joins and now aggregates and subqueries. There's only one set of hash computation routines now. But you are right that the issues under discussion only affect hash indexes, not in-memory hash tables. I know someone reported a problem with the hash indexes (data loss, serious)--- was that a new 7.4 but or something that has existed for a long time? AFAICT the bug's been there since Berkeley days. When were you considering making these changes? Now. We have three choices: fix it, ship 7.4 with a known data-loss bug, or remove hash indexes. Which do you like? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Tom Lane [EMAIL PROTECTED] writes: If multiple inserters failed to split, the index might still be overfull, but eventually, the index will not be overfull and split attempts will stop. (We could make a successful splitter loop to see if the index is still overfull, but I think I prefer distributing the split overhead across successive insertions.) If one backend is executing a query but the client has paused reading records, is it possible the shared lock on the index bucket would be held for a long time? If so wouldn't it be possible for an arbitrarily large number of records to be inserted while the lock is held, eventually causing the bucket to become extremely large? Is there a maximum size at which the bucket split MUST succeed or is it purely a performance issue that the buckets be reasonably balanced? -- greg ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: Tom Lane wrote: I've been looking at fixing the problem reported a few days ago whereby a bucket split in a hash index messes up the state of concurrent scans of the index, possibly causing some tuples to be missed by the scans. Seems you are adding locking similar to what we already do in btree. Not adding locking --- hash already has locking. The problem is the locking is wrong (both inadequate and deadlock prone :-(). Yea, I meant adding locking sophistication similar to btree. :-) I know we have two sets of hash codes -- the one used for hash indexes, and another used for hash joins and now aggregates and subqueries. There's only one set of hash computation routines now. But you are right that the issues under discussion only affect hash indexes, not in-memory hash tables. Oh, yes, you merged them, but the index code is somehow separate for locking issues (per-backend hashes don't have to lock anything). I know someone reported a problem with the hash indexes (data loss, serious)--- was that a new 7.4 but or something that has existed for a long time? AFAICT the bug's been there since Berkeley days. Oh. When were you considering making these changes? Now. We have three choices: fix it, ship 7.4 with a known data-loss bug, or remove hash indexes. Which do you like? Since Berkeley, huh? Seems like a big change. I would think the logical solution would be to fix it in 7.5, but that leaves us with shipping a hash that is now known to be buggy. While it was always buggy, we didn't know for sure until now. We could disable hash indexes in 7.4, but then re-enable them in 7.5 with the fix. That seems kind of strange becuase the current hash indexes would be moved to btree, but then they would be have to be moved manually back to hash in 7.5. Is there a way to convert them to btree, but still have them dump as HASH in pg_dump, so when they are moved to 7.5, they move back to hashes? That might be just too wierd. If all the code changes are only in the hash indexes, and they are known to be buggy, maybe we should just give it a shot for 7.4 knowing it probably can't get worse than it already is (but it could). I know I am just shooting out ideas, but I am sure one of them still stick with the group. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Bruce Momjian [EMAIL PROTECTED] writes: If all the code changes are only in the hash indexes, and they are known to be buggy, maybe we should just give it a shot for 7.4 knowing it probably can't get worse than it already is (but it could). That's basically my opinion. It's unlikely to get more broken than it already is. I've already found several unrelated bugs while studying the code :-(. For starters, it doesn't come close to working correctly when there's more than one overflow-page-bitmap page. Since we've not heard reports of problems, I doubt anyone's exercised it with large numbers of overflow pages. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Tom Lane kirjutas E, 01.09.2003 kell 15:41: Bruce Momjian [EMAIL PROTECTED] writes: I know someone reported a problem with the hash indexes (data loss, serious)--- was that a new 7.4 but or something that has existed for a long time? AFAICT the bug's been there since Berkeley days. One could check how BSDDB (http://www.sleepycat.com) handles these issues. It is reported to have started as btree/hash index code extracted from an early version of postgres, so perhaps there one could at least get some ideas, though their locking / concurrency control are probably much different. -- Hannu ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Preliminary notes about hash index concurrency (long)
Greg Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: If multiple inserters failed to split, the index might still be overfull, but eventually, the index will not be overfull and split attempts will stop. If one backend is executing a query but the client has paused reading records, is it possible the shared lock on the index bucket would be held for a long time? Yes. If so wouldn't it be possible for an arbitrarily large number of records to be inserted while the lock is held, eventually causing the bucket to become extremely large? Yes. Is there a maximum size at which the bucket split MUST succeed or is it purely a performance issue that the buckets be reasonably balanced? AFAICS it's purely a performance issue. Note also that a hash index will by definition have sucky performance on large numbers of equal keys, so anyone who is using a hash index on such a column deserves what they get. Now you could possibly have this worst-case scenario even on a column with well-scattered keys, but it seems improbable. regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster