Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Wed, Sep 4, 2013 at 8:08 PM, Andres Freund and...@2ndquadrant.com wrote: You've proposed an framework and algorithm for something I'd really, really like to see. I don't think that it can work explicitly as you proposed, so I roughly sketched out a solution I can see. I don't want my idea to win, I want a idea to win. I haven't fleshed out my idea to the point where I would consider it something ready to implement or something. You're the patch author here whose plans are laid open to be scrutinized ;). If you think my idea has merit, use and adapt it to reality. If not, find another, better, solution. Even if our path to that goal is confrontational at times, the goal is to find a good solution, not the argument itself. Totally agreed. I haven't argued about INSERT ... DUPLICATE LOCK because the page locking scheme doesn't seem to work out for plain DUPLICATE. No need to think/argue about the fancier version in that case. Check. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sat, Aug 31, 2013 at 2:34 PM, Andres Freund and...@2ndquadrant.com wrote: After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. I'd like to chime in with a large +1 for this sentiment and pretty much everything else Andres said further downthread. The operations across which you're proposing to hold buffer locks seem at least an order of magnitude too complex to get away with something like that. Concurrent readers will block in a non-interruptible wait if they try to access a buffer, and that's a situation that will be intolerable if, for example, it can persist across a disk I/O. And I don't see any way to avoid that. One possible alternative to inserting promises into the index pages themselves might be to use some kind of heavyweight lock. The way that SIREAD locks work is not entirely dissimilar to what's needed here, I think. Of course, the performance implications of checking for lots of extra locks during inserts could be pretty bad, so you'd probably need some way of avoiding that in common cases, which I don't know exactly how to do, but maybe there's a way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Wed, Sep 4, 2013 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote: Concurrent readers will block in a non-interruptible wait if they try to access a buffer, and that's a situation that will be intolerable if, for example, it can persist across a disk I/O. And I don't see any way to avoid that. Then I have some bad news for you - that's already quite possible. _bt_insertonpg() is called with the very same buffer exclusive locked, and is where we do btree page splits. The first thing that _bt_split does is this: /* Acquire a new page to split into */ rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); (Obviously this may ultimately result in the storage manager extending the index relation). Plus the split is WAL-logged immediately afterwards, which could result in us blocking on someone else's I/O under contention (granted, the XLogInsert scaling patch has now considerably ameliorated that general problem). All the while holding an exclusive lock on the same buffer. Note also that _bt_insertonpg() is called again recursively after a page split. And of course we WAL-log btree index tuple insertion proper all the time. Let me be clear about something, though: I am not at all dismissive of Andres' concerns. I was concerned about many of the same things before I posted the patch. I think that Andres and I ought to re-frame this discussion a little bit. Right now, the presumption we seem to be making, perhaps without even realizing it, is this is about providing functionality equivalent to MySQL's INSERT IGNORE; insert tuples proposed for insertion where possible, otherwise do not. However, Andres and I, not to mention almost every Postgres user, are actually much more interested in something like INSERT...ON DUPLICATE KEY LOCK FOR UPDATE. That's what this mechanism has to support, even if it is *technically* possible to commit just INSERT...ON DUPLICATE KEY IGNORE and punt on these other questions. It was silly of me not to do that up-front. So I'm going to try and produce a patch that does this as well for my next revision. Maybe this will enable Andres to refute my position that the buffer locking approach to speculative insertion/value locking may actually be acceptable. If that gets us closer to having the feature committed in some form, then I welcome it. I fully expect to be held to this new standard - it would be insane to do anything less. I don't want to throw out an old IGNORE value locking mechanism and invent a whole new one for upserting a little bit down the line. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 2013-09-04 15:01:57 -0700, Peter Geoghegan wrote: On Wed, Sep 4, 2013 at 1:26 PM, Robert Haas robertmh...@gmail.com wrote: Concurrent readers will block in a non-interruptible wait if they try to access a buffer, and that's a situation that will be intolerable if, for example, it can persist across a disk I/O. And I don't see any way to avoid that. Then I have some bad news for you - that's already quite possible. _bt_insertonpg() is called with the very same buffer exclusive locked, and is where we do btree page splits. The first thing that _bt_split does is this: /* Acquire a new page to split into */ rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE); (Obviously this may ultimately result in the storage manager extending the index relation). I don't think that's an argument for much TBH. Those operations are way much less heavyweight than the ones you're proposing to hold the pages locked over and there actually is a forward guarantee. And it's very hard to avoid locking a page exlusively once you've decided that you need to split the page. You cannot just release the lock while you look for a victim buffer. I think that Andres and I ought to re-frame this discussion a little bit. Right now, the presumption we seem to be making, perhaps without even realizing it, is this is about providing functionality equivalent to MySQL's INSERT IGNORE; insert tuples proposed for insertion where possible, otherwise do not. However, Andres and I, not to mention almost every Postgres user, are actually much more interested in something like INSERT...ON DUPLICATE KEY LOCK FOR UPDATE. Yes, the promises approach gets more advantageous if you think about UPSERT because most of the work will be paid of when the UPDATE occurs. Maybe this will enable Andres to refute my position that the buffer locking approach to speculative insertion/value locking may actually be acceptable. Sorry to be harsh here, but I don't think I need to do that. I've explained most of the reasons I see that that approach won't work out and so far I don't see those refuted. And to me those issues seem to be fatal for the approach. If you find a solution to the problems noted uppon - great. So far it seems neither Robert nor me see how that is possible, but that obviously doesn't mean it's impossible that you find a way. But why should I argue further until you proof me wrong (newer patch or explaining changed algorithms)? If you don't think my arguments are valid, well, I've brought those up I see as relevant and that's it. Can't do much further. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Wed, Sep 4, 2013 at 3:39 PM, Andres Freund and...@2ndquadrant.com wrote: Sorry to be harsh here, but I don't think I need to do that. I've explained most of the reasons I see that that approach won't work out and so far I don't see those refuted. And to me those issues seem to be fatal for the approach. If you find a solution to the problems noted uppon - great. So far it seems neither Robert nor me see how that is possible, but that obviously doesn't mean it's impossible that you find a way. It seems you've misunderstood. My position was that I don't think it's terribly useful to have a discussion about approaches to value locking without considering how that needs to fit in with row locking too. So it makes sense as an immediate goal to introduce that into the patch. Any scheme is going to be constrained by having to think about the interplay with value locking and row locking going forward. For example, in my scheme, I couldn't block on locking the row if that meant that buffer locks would be held indefinitely. There are also deadlock hazards there for either scheme that must be carefully considered. What possible objection could you have? Suppose it was the case that I was dead set on using buffer locking like this, because I'm stubborn or whatever. I've just made life harder for myself, while probably not also putting the same degree of burden on alternative proposals. Maybe I am stubborn, but I don't think I'm stubborn about the basic approach taken in this particular patch. I've merely been pointing out, as I feel is my role as a participant in the community's adversarial system of reaching agreement, the problems that exist with your proposal, and some inconsistencies in your objections to mine. Obviously the basic approach will remain the most difficult and probably controversial part of this. Even if I threw my hands up and immediately accepted everything you said, that would still be true. We need to get all of the constraints in place sooner rather than later. But why should I argue further until you proof me wrong (newer patch or explaining changed algorithms)? I didn't ask you to. You shouldn't. If you don't think my arguments are valid, well, I've brought those up I see as relevant and that's it. Can't do much further. Uh, I just said that I thought your arguments were totally valid. I couldn't have been clearer about that. Actually, I'm pretty surprised that you haven't been the one insisting that I add a row locking component from quite early on for exactly these reasons. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
Hi, We seem to be miscommunication a bit. You've proposed an framework and algorithm for something I'd really, really like to see. I don't think that it can work explicitly as you proposed, so I roughly sketched out a solution I can see. I don't want my idea to win, I want a idea to win. I haven't fleshed out my idea to the point where I would consider it something ready to implement or something. You're the patch author here whose plans are laid open to be scrutinized ;). If you think my idea has merit, use and adapt it to reality. If not, find another, better, solution. Even if our path to that goal is confrontational at times, the goal is to find a good solution, not the argument itself. I haven't argued about INSERT ... DUPLICATE LOCK because the page locking scheme doesn't seem to work out for plain DUPLICATE. No need to think/argue about the fancier version in that case. Regards, Andres Please excuse brevity and formatting - I am writing this on my mobile phone. Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Wed, Sep 4, 2013 at 5:08 PM, Andres Freund and...@2ndquadrant.com wrote: I don't want my idea to win, I want a idea to win. I know. I want the same thing. You're the patch author here whose plans are laid open to be scrutinized ;). If you think my idea has merit, use and adapt it to reality. If not, find another, better, solution. Sure. Even if our path to that goal is confrontational at times, the goal is to find a good solution, not the argument itself. Agreed. I haven't argued about INSERT ... DUPLICATE LOCK because the page locking scheme doesn't seem to work out for plain DUPLICATE. No need to think/argue about the fancier version in that case. I see where you're coming from, but my point is precisely that adding a row locking component *isn't* fancier. I've come to realize that it's an integral part of the patch, and that my previous omission of row locking - and the subsequent defence of that decision I made in passing - was ridiculous. In a world where IGNORE/not locking is a feature we support, it can only exist as an adjunct to ON DUPLICATE KEY LOCK - certainly not the other way around. The tail cannot be allowed to wag the dog. In posting the patch with a row locking component, I'll only be asking you to consider that aspect separately. You may find that seeing the problems I encounter and how I handle them will make you (or others) re-assess your thoughts on value locking in a direction that nobody expects right now. Equally, I myself may reassess things. Now, I don't guarantee that that's the case, but it certainly seems very possible. And so even if I were to concede right now that the buffer locking approach is not workable, I feel it would be a little premature to seriously get down to talking about the alternatives in detail. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 05/09/13 08:26, Robert Haas wrote: On Sat, Aug 31, 2013 at 2:34 PM, Andres Freund and...@2ndquadrant.com wrote: After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. I'd like to chime in with a large +1 for this sentiment and pretty much everything else Andres said further downthread. The operations across which you're proposing to hold buffer locks seem at least an order of magnitude too complex to get away with something like that. Concurrent readers will block in a non-interruptible wait if they try to access a buffer, and that's a situation that will be intolerable if, for example, it can persist across a disk I/O. And I don't see any way to avoid that. One possible alternative to inserting promises into the index pages themselves might be to use some kind of heavyweight lock. The way that SIREAD locks work is not entirely dissimilar to what's needed here, I think. Of course, the performance implications of checking for lots of extra locks during inserts could be pretty bad, so you'd probably need some way of avoiding that in common cases, which I don't know exactly how to do, but maybe there's a way. How about an 'Expensive bit' (of course, renamed to sound more professional and to better indicate what it does!) - if the bit is set, then do the expensive processing. This should have minimal impact for the common case, so extensive checking would only be required when lots of locks need to be checked. I strongly suspect that the situation, is way more complicated, than I imply above - but possibly, a more sophisticated version of the above might help?
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 09/03/2013 06:18 AM, Peter Geoghegan wrote: On Mon, Sep 2, 2013 at 6:25 AM, Craig Ringer cr...@2ndquadrant.com wrote: It'll be yet another way for people to get upsert wrong, of course. They'll use a wCTE with RETURNING REJECTS to do an UPDATE of the rejects w/o locking the table against writes first. Documenting this pitfall should be enough, though. My preferred solution is to actually provide a variant to lock the rows implicated in the would-be unique constraint violation. Obviously that's a harder problem to solve. That'd certainly be the ideal, but as you say is far from simple, and IIRC prior discussions here have suggested the SSI / predicate locking stuff won't help. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Tue, Sep 3, 2013 at 12:52 AM, Craig Ringer cr...@2ndquadrant.com wrote: That'd certainly be the ideal, but as you say is far from simple, and IIRC prior discussions here have suggested the SSI / predicate locking stuff won't help. I think that by far the strongest argument for what Andres suggested (that has, incidentally, been suggested by a number of others), which is to insert what he termed promise index tuples before a heap tuple, is that there is no sensible way to lock the tuples while holding an exclusive buffer lock on an index. It could take basically any amount of time to acquire those locks (see GetTupleForTrigger() to get a rough idea of how that ought to work), and it would be totally unacceptable if that meant that lots of people blocked on the page lock that an upserter held while it tried to acquire locks on tuples (we could only unlock the buffer lock/locked value when the rows were locked). So while I think what I've done here is probably workable for the ON DUPLICATE KEY IGNORE case, perhaps I'm being short-sighted in focusing on that. I'm surprised Andres didn't call me out on this himself, actually. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Tue, Sep 3, 2013 at 2:11 AM, Peter Geoghegan p...@heroku.com wrote: and it would be totally unacceptable if that meant that lots of people blocked on the page lock that an upserter held while it tried to acquire locks on tuples By which I mean: it seems shaky that I'd then be assuming that to lock all of a row's unique index values was to lock the row itself, and so that there was no need to worry about blocking on acquiring an actual row lock while holding all those exclusive/write buffer locks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 2013-09-02 21:59:42 -0700, Peter Geoghegan wrote: I can't really agree with this part. First off, a heap_insert() certainly isn't lightweight operation. There's several things that can take extended time: * toasting of rows (writing a gigabyte worth of data...) * extension of the heap/toast table * writeout of a dirty victim buffer if the page isn't in s_b * reading in the target page if the page isn't in s_b * waiting for an exclusive lock on the target pages These are all valid concerns. I'll need to better assess just how bad this can get (hopefully with help from others). But some of this stuff can surely be done separately for my case. For example, surely we could make TOASTing occur after index insertion proper when locks are released, without too much work. I don't think those are that easy to solve, but I'd like to be surprised. E.g. the toasting cannot really happen after the main table insert since the inserted rows needs to refer to the toast id. And you don't even know the final size of a row without doing the toasting, since how much it will get compressed depends on the amount of free space on the current target buffer. I'd expect that with the current implementation strategy if you e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever actually hitting duplicates, you'd see a very, very noticeable hit in scalability. I wouldn't expect a single session to get much slower bug trying a dozen or two should show a very noticeable dip. I do not accept the premise of your questions regarding the patch's performance, which appears to be that it's fair to expect the same level of performance of INSERT .,. ON DUPLICATE IGNORE as it is to expect of regular INSERT. No scheme is going to get that, including your scheme, which necessitates WAL logging everything related to index tuple insertion twice for each unique index. We certainly didn't require SSI to perform as well as the current REPEATABLE READ, and certainly not as well as READ COMMITTED, and for good reasons. I think it's ok that INSERT .. IGNORE itself is noticeably slower than a normal INSERT. What I am concerned about is it negatively impacting overall scalability. I'm not going to bother benchmarking the patch myself until I give it another whack. There is some really obvious low-hanging performance fruit. Obviously by this I mean doing as little as possible when holding the exclusive lock. Sounds fair. I am pretty sure it would be better. We know that nbtree works pretty damn well under concurrency. And the promise logic wouldn't change anything of that. It would have multiple extra steps. Including WAL-logging of the initial insertion, and clean-up for the IGNORE case. And WAL-logging of the cleanup. Instead of doing exactly nothing extra in the IGNORE case, which, under my proposal, has an additional duration of the exclusive lock being held of zero (instead of the ereport() call we just immediately release the lock and report the duplicate). The IGNORE case matters too, and there is going to be considerably more exclusive locking for that case with your proposal, before we even talk about other overhead. What I mean is that none of the individual proposed steps require changing the locking semantics/granularity. I.e. we will never hold locks on more pages at once than we do for a normal insert. Yes, one page will get locked twice, but for shorter than a normal insert (all you need to do is to rewrite the target of the item) and without having any other pages locked. Furthermore, I am pretty sure that locking an index page *and* related heap pages exclusively at the same time will be a deadlock nightmare. Even in the case of only having a single unique index. It's seems like what you're saying here is I don't know, this extended index locking thing seems pretty sketchy to me. Which is a perfectly reasonable thing to say in the absence of better analysis from either of us - it is sketchy. Yes, that's basically what I am saying. I'd really like some input from others on this. Perhaps you could write up an email/thread just about the locking semantics after the next version of the patch (since that presumably will change the semantics)? I have taken a certain amount of reassurance from the fact that I was not able to observe any deadlocks when pushing the patch hard - perhaps that's naive. I mean hammering a table with inserts using pgbench for 3 hours, with 64 INSERT...DUPLICATE KEY IGNORE sessions. PK values were over a range of values from 115 million. This was with shared_buffers set to 512MB and assertions enabled. It stood up without any deadlocks or other manifestations of bugs. Granted, I did that before fixing the bug you've highlighted (but after you pointed it out), and so I haven't tested the multi-index case in the same way. I can readily believe that there are no deadlocks with rather straightforward schemas/queries
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Fri, 2013-08-30 at 15:09 -0700, Peter Geoghegan wrote: The attached WIP patch implements this for Postgres, with a few notable differences: Your patch breaks the ECPG regression tests: ../../preproc/ecpg --regression -I./../../include -o insupd.c -I. insupd.pgc insupd.pgc:22: ERROR: syntax error at or near into make[3]: *** [insupd.c] Error 3 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sun, Sep 1, 2013 at 11:38 PM, Peter Eisentraut pete...@gmx.net wrote: ../../preproc/ecpg --regression -I./../../include -o insupd.c -I. insupd.pgc insupd.pgc:22: ERROR: syntax error at or near into make[3]: *** [insupd.c] Error 3 I'll work on fixing it then. Thanks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 08/31/2013 06:40 AM, Josh Berkus wrote: 3) RETURNING is expanded - RETURNING REJECTS * is now possible where that makes sense. Oh, nifty! OK, now I can *really* use this feature. Absolutely; especially combined with COPY to a staging TEMPORARY or UNLOGGED table. It'll be yet another way for people to get upsert wrong, of course. They'll use a wCTE with RETURNING REJECTS to do an UPDATE of the rejects w/o locking the table against writes first. Documenting this pitfall should be enough, though. Speaking of upsert, I'm starting to think that to solve the upsert problem without forcing full-table locking on people we need a lock type that conflicts only with DELETE/INSERT and a way to prevent UPDATEs from changing the key. Kind of like the table level inverse of FOR KEY UPDATE - a way to say you can change rows, just cannot add or remove from the set of keys. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Mon, Sep 2, 2013 at 6:25 AM, Craig Ringer cr...@2ndquadrant.com wrote: It'll be yet another way for people to get upsert wrong, of course. They'll use a wCTE with RETURNING REJECTS to do an UPDATE of the rejects w/o locking the table against writes first. Documenting this pitfall should be enough, though. My preferred solution is to actually provide a variant to lock the rows implicated in the would-be unique constraint violation. Obviously that's a harder problem to solve. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sun, Sep 1, 2013 at 4:06 AM, Andres Freund and...@2ndquadrant.com wrote: We're looking for the first duplicate. So it would probably be okay for the IGNORE case to not bother retrying and re-locking if the other transaction committed (which, at least very broadly speaking, is the likely outcome). Hm. Could you explain this a bit more? Do you mean that we can stop trying to insert after the first violation (yes, sure if so)? Yeah, that's all I mean. Nothing controversial there. I can't really agree with this part. First off, a heap_insert() certainly isn't lightweight operation. There's several things that can take extended time: * toasting of rows (writing a gigabyte worth of data...) * extension of the heap/toast table * writeout of a dirty victim buffer if the page isn't in s_b * reading in the target page if the page isn't in s_b * waiting for an exclusive lock on the target pages These are all valid concerns. I'll need to better assess just how bad this can get (hopefully with help from others). But some of this stuff can surely be done separately for my case. For example, surely we could make TOASTing occur after index insertion proper when locks are released, without too much work. Secondly, you seem to be saying that such a lock would only block other sessions from finishing of tuple insertions - but it will block normal index searches as well? I have a hard time accepting the fact that one session using INSERT ... KEY IGNORE will block lookup of a range of values from the index in *other* sessions. I'm sorry that I created the impression that I was denying an impact on read queries as compared to regular insertion. Of course, regular insertion also blocks read queries for very small intervals, and there could be quite a bit of work to be done with that exclusive lock held already. I am only proposing increasing that period for this case, and usually by not terribly much. I'd expect that with the current implementation strategy if you e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever actually hitting duplicates, you'd see a very, very noticeable hit in scalability. I wouldn't expect a single session to get much slower bug trying a dozen or two should show a very noticeable dip. I do not accept the premise of your questions regarding the patch's performance, which appears to be that it's fair to expect the same level of performance of INSERT .,. ON DUPLICATE IGNORE as it is to expect of regular INSERT. No scheme is going to get that, including your scheme, which necessitates WAL logging everything related to index tuple insertion twice for each unique index. We certainly didn't require SSI to perform as well as the current REPEATABLE READ, and certainly not as well as READ COMMITTED, and for good reasons. That said, I'm not dismissive of these concerns. As I said already, by all means, let us scrutinise the performance of the patch. I'm not going to bother benchmarking the patch myself until I give it another whack. There is some really obvious low-hanging performance fruit. Obviously by this I mean doing as little as possible when holding the exclusive lock. I think we could optimize away the second search with some cleverness. If we remember pin the page from the first search we know that the promise will have to be either on that page or - after a page split - potentially to ones on the right. Which we should be able to find by repeatedly stepping right. Similar to logic of doing the _bt_moveright() in _bt_doinsert(). That may well be possible, but we'd still have to do a binary search on the page. And under your proposal, that will happen twice - each time with an exclusive buffer lock held. Since our intent would be to update in-place, the entire binary search operation would need a full write/exclusive lock the second time around as well. You might even end up doing more work with an exclusive/write lock held than under my proposal, before you even factor in the other increased costs (granted, that's not that likely, but you're still doing much more with exclusive locks held than the regular insertion case). What you describe here is the kind of trick that could be very useful to another, existing common cases -- blocking on an xid in the event of a possible duplicate (as you know, it's okay to hold pins pretty much indefinitely) -- and yet has remained unimplemented for many years. I am pretty sure it would be better. We know that nbtree works pretty damn well under concurrency. And the promise logic wouldn't change anything of that. It would have multiple extra steps. Including WAL-logging of the initial insertion, and clean-up for the IGNORE case. And WAL-logging of the cleanup. Instead of doing exactly nothing extra in the IGNORE case, which, under my proposal, has an additional duration of the exclusive lock being held of zero (instead of the ereport() call we just immediately release the lock and report the duplicate).
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sat, Aug 31, 2013 at 7:53 PM, Peter Geoghegan p...@heroku.com wrote: With a table with many unique indexes and many reasons to decide to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY IGNORE, it isn't hard to imagine them all becoming heavily bloated very quickly. There may be no index tuple directly corresponding to some heap tuple where you might expect that, due to transactions aborting (as with a traditional unique constraint violations) or because of HOT, but, prior to now, never the other way around (heap tuples are created first and physically killed last). That is an assumption that has existed since the Berkeley days, and it is something that may be treated as an invariant in some places. Autovacuum is one such place, I'd say. Isn't it the case that the autovacuum launcher daemon might not hear about all of this index-only bloat, even if it was extremely serious? Isn't the threshold it follows only in respect of heap tuples (even if the statistics collector does collect index tuple stats separately)? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 2013-08-31 19:38:49 -0700, Peter Geoghegan wrote: On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund and...@2ndquadrant.com wrote: Won't S2 in this case exclusively lock a page in foo_a_key (the one where 2 will reside on) for 3600s, while it's waiting for S1 to finish while getting the speculative insertion into foo_b_key? Yes. The way the patch currently handles that case is unacceptable. It needs to be more concerned about holding an exclusive lock on earlier-locked indexes when locking the second and subsequent indexes iff it blocks on another transaction in the course of locking the second and subsequent indexes. I think that the fix here may be to do something analogous to what the code already does: release all buffer locks, wait for the other transaction to commit/abort, and try again. The difference being that we try again across all unique indexes rather than just this one. I have other obligations to take care of over the next couple of days, so I don't really have time to code that up and satisfy myself that it's robust; I'd prefer to have something more concrete to back this thought up, but that's my immediate reaction for what it's worth. That could probably made work. I still have some concerns with the fundamentals of your concept I voice later in the mail... We're looking for the first duplicate. So it would probably be okay for the IGNORE case to not bother retrying and re-locking if the other transaction committed (which, at least very broadly speaking, is the likely outcome). Hm. Could you explain this a bit more? Do you mean that we can stop trying to insert after the first violation (yes, sure if so)? After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. Just to be clear: I'm certainly not suggesting that it's okay to have exclusive buffer locks held for any amount of time longer than instantaneous. Furthermore, it isn't necessary to do anything different to what we currently do with non-unique btree indexes during speculative insertion - I don't treat them any differently, and they only get locked in the usual way at the usual time, after heap insertion. Much of the time, there'll only be a primary key index to lock, and you'll have increased the window of the exclusive write-lock (compared to the existing code when doing a regular insertion) by only a very small amount - the duration of heap tuple insertion (the baseline here is the duration of finding an insertion location on page for index tuple, tuple insertion + possible page split). We're talking about a write-lock on a btree leaf page, which is only going to block other sessions from *finishing off* tuple insertion, and then only for a minority of such insertions much of the time. I can't really agree with this part. First off, a heap_insert() certainly isn't lightweight operation. There's several things that can take extended time: * toasting of rows (writing a gigabyte worth of data...) * extension of the heap/toast table * writeout of a dirty victim buffer if the page isn't in s_b * reading in the target page if the page isn't in s_b * waiting for an exclusive lock on the target pages Secondly, you seem to be saying that such a lock would only block other sessions from finishing of tuple insertions - but it will block normal index searches as well? I have a hard time accepting the fact that one session using INSERT ... KEY IGNORE will block lookup of a range of values from the index in *other* sessions. The question must be asked: if you don't think this will scale okay, what are you comparing it to? The most useful basis of comparison is likely to be other systems. Asking about uniqueness across many unique indexes, and getting them all to commit to their unanimous position for a moment in time is inherently involved - I don't see a way to do it other than lock values, and I don't see any granularity at which that's possible other than the btree page granularity. I'd expect that with the current implementation strategy if you e.g. made pgbench use INSERT .. ON DUPLICATE IGNORE without ever actually hitting duplicates, you'd see a very, very noticeable hit in scalability. I wouldn't expect a single session to get much slower bug trying a dozen or two should show a very noticeable dip. What I had in mind when playing around with this is the following trick: Just as in your patch split index insertion into two phases and perform one of them before inserting the heap tuple. In what you called speculative insertion don't stop at locking the page, but actually insert an index tuple in the index. As we haven't yet inserted the heap tuple we obviously can't insert an actual tuple. Instead set a flag on the item and reuse the the item pointer to store the xid of the inserting process. I coined those items insertion promises... I did consider that sort
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 2013-08-31 23:02:46 -0700, Peter Geoghegan wrote: On Sat, Aug 31, 2013 at 7:53 PM, Peter Geoghegan p...@heroku.com wrote: With a table with many unique indexes and many reasons to decide to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY IGNORE, it isn't hard to imagine them all becoming heavily bloated very quickly. There may be no index tuple directly corresponding to some heap tuple where you might expect that, due to transactions aborting (as with a traditional unique constraint violations) or because of HOT, but, prior to now, never the other way around (heap tuples are created first and physically killed last). That is an assumption that has existed since the Berkeley days, and it is something that may be treated as an invariant in some places. Autovacuum is one such place, I'd say. Isn't it the case that the autovacuum launcher daemon might not hear about all of this index-only bloat, even if it was extremely serious? Isn't the threshold it follows only in respect of heap tuples (even if the statistics collector does collect index tuple stats separately)? I don't think the amount of bloat should get that big since we should immediately kill the promises when we've detected that we conflict on a following unique index. The reason vacuum needs to care at all is only that we might have crashed after inserting the promise but before deleting it and thus haven't cleaned up. That itself isn't too bad since that should happen infrequently but if we then get a xid wraparound later we could be waiting on the wrong transaction. Sure, there would need to be some changes to that code. I think we can trigger that fairly easy by just triggering relation vacuums via relfrozenxid and enforce a index bulkdelete on full table vacuums even if there are no dead tuples. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
Hi, On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote: On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund and...@2ndquadrant.com wrote: While I ponder on it some more, could you explain whether I understood something correctly? Consider the following scenario: CREATE TABLE foo(a int UNIQUE, b int UNIQUE); S1: INSERT INTO foo(0, 0); S1: BEGIN; S1: INSERT INTO foo(1, 1); S1: SELECT pg_sleep(3600); S2: BEGIN; S2: INSERT INTO foo(2, 1); S3: SELECT * FROM foo WHERE a = 0; Won't S2 in this case exclusively lock a page in foo_a_key (the one where 2 will reside on) for 3600s, while it's waiting for S1 to finish while getting the speculative insertion into foo_b_key? Yes. The way the patch currently handles that case is unacceptable. It needs to be more concerned about holding an exclusive lock on earlier-locked indexes when locking the second and subsequent indexes iff it blocks on another transaction in the course of locking the second and subsequent indexes. After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. What I had in mind when playing around with this is the following trick: Just as in your patch split index insertion into two phases and perform one of them before inserting the heap tuple. In what you called speculative insertion don't stop at locking the page, but actually insert an index tuple in the index. As we haven't yet inserted the heap tuple we obviously can't insert an actual tuple. Instead set a flag on the item and reuse the the item pointer to store the xid of the inserting process. I coined those items insertion promises... Index searches will ignore promises they see, but concurrent index insertions will notice it's a promise and wait for the xid (or not, if it has aborted). That fits pretty well into the existing btree code. If the insertion of an index promise worked for all unique indexes and the heap tuple has been inserted, convert those promises into actual item pointers pointing to the heap tuple. That probably requires a second search and some cuteness to find the correct pointer because of concurrent splits, but that's not too bad (possibly you can do away with that if we continue to hold a pin). The fact that now xids can be in the index creates some slight complications because it now needs to be vacuumed - but that's relatively easy to fit into the bulkdelete api and I don't think the contortions to the vacuum code will be too bad. Imo that solution is fairly elegant because it doesn't cause any heap bloat, and it causes the same amount of index bloat insert-into-heap-first type of solutions cause. It also has pretty good concurrency behaviour because you never need to retain a page lock for longer than a normal index insertion. It's certainly a bit more complex than your solution tho... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund and...@2ndquadrant.com wrote: Won't S2 in this case exclusively lock a page in foo_a_key (the one where 2 will reside on) for 3600s, while it's waiting for S1 to finish while getting the speculative insertion into foo_b_key? Yes. The way the patch currently handles that case is unacceptable. It needs to be more concerned about holding an exclusive lock on earlier-locked indexes when locking the second and subsequent indexes iff it blocks on another transaction in the course of locking the second and subsequent indexes. I think that the fix here may be to do something analogous to what the code already does: release all buffer locks, wait for the other transaction to commit/abort, and try again. The difference being that we try again across all unique indexes rather than just this one. I have other obligations to take care of over the next couple of days, so I don't really have time to code that up and satisfy myself that it's robust; I'd prefer to have something more concrete to back this thought up, but that's my immediate reaction for what it's worth. We're looking for the first duplicate. So it would probably be okay for the IGNORE case to not bother retrying and re-locking if the other transaction committed (which, at least very broadly speaking, is the likely outcome). After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. Just to be clear: I'm certainly not suggesting that it's okay to have exclusive buffer locks held for any amount of time longer than instantaneous. Furthermore, it isn't necessary to do anything different to what we currently do with non-unique btree indexes during speculative insertion - I don't treat them any differently, and they only get locked in the usual way at the usual time, after heap insertion. Much of the time, there'll only be a primary key index to lock, and you'll have increased the window of the exclusive write-lock (compared to the existing code when doing a regular insertion) by only a very small amount - the duration of heap tuple insertion (the baseline here is the duration of finding an insertion location on page for index tuple, tuple insertion + possible page split). We're talking about a write-lock on a btree leaf page, which is only going to block other sessions from *finishing off* tuple insertion, and then only for a minority of such insertions much of the time. It's probably worth doing as much work up-front as possible (in terms of asking questions about if we even want to lock the indexes and so on, within ExecLockIndexTuples()), so as to decrease the window that those locks are held. That's just an obvious optimization. As users add unique indexes to tables, things do of course get worse and worse. However, since in order for any of this to matter, there has to be lots of concurrent activity in clustered ranges of values, it's pretty likely that you'd conclude that tuple insertion should not go ahead early on, and not end up doing too much exclusive/write locking per tuple. The question must be asked: if you don't think this will scale okay, what are you comparing it to? The most useful basis of comparison is likely to be other systems. Asking about uniqueness across many unique indexes, and getting them all to commit to their unanimous position for a moment in time is inherently involved - I don't see a way to do it other than lock values, and I don't see any granularity at which that's possible other than the btree page granularity. What I had in mind when playing around with this is the following trick: Just as in your patch split index insertion into two phases and perform one of them before inserting the heap tuple. In what you called speculative insertion don't stop at locking the page, but actually insert an index tuple in the index. As we haven't yet inserted the heap tuple we obviously can't insert an actual tuple. Instead set a flag on the item and reuse the the item pointer to store the xid of the inserting process. I coined those items insertion promises... I did consider that sort of approach from an early stage - Simon suggested it to me privately early last year. Actually, I even pointed out in the 2012 developer meeting that there were unused IndexTuple flag bits available for such purposes. Index searches will ignore promises they see, but concurrent index insertions will notice it's a promise and wait for the xid (or not, if it has aborted). If inserters must block, they'll have to restart - I'm sure you'll agree that they can't sit on even a shared/read buffer lock indefinitely. So, have you really saved anything? Especially since you'll have to search at least twice (more shared locks on *root* pages) where my scheme only needs to search once. Also, are you really sure that index searches should have license to ignore these promises? What about dirty snapshots?
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sat, Aug 31, 2013 at 7:38 PM, Peter Geoghegan p...@heroku.com wrote: Imo that solution is fairly elegant because it doesn't cause any heap bloat, and it causes the same amount of index bloat insert-into-heap-first type of solutions cause. I don't think that's a reasonable comparison. Bloating indexes with dead *duplicate* tuples is, in general, worse than doing the same with the heap. If you end up with a heavily bloated index with lots of duplicates, that will adversely affect performance worse than with non-duplicate index tuples (see [1] for example). That was one reason that HOT helped as much as it did, I believe. Bear in mind also that one unique index could hardly ever have real duplicates, but it would still get broken promise/bloat tuples because another, later index said it had a duplicate. All of the indexes get bloated (quite possibly) just because of a duplicate in one. With a table with many unique indexes and many reasons to decide to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY IGNORE, it isn't hard to imagine them all becoming heavily bloated very quickly. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] INSERT...ON DUPLICATE KEY IGNORE
For many years now, MySQL has a feature called INSERT IGNORE [1]; SQLite has INSERT ON CONFLICT IGNORE [2]; SQL Server has an option called IGNORE_DUP_KEY and Oracle has a hint called IGNORE_ROW_ON_DUPKEY_INDEX (they acknowledge that it's a bit odd that a hint changes the semantics of a DML statement - of course, MERGE has always been able to do something approximately equivalent). Each of these features has their respective systems simply not insert certain tuples that will result in a duplicate key violation, while potentially proceeding with the insertion of other, non-violating tuples. The attached WIP patch implements this for Postgres, with a few notable differences: 1) The patch is only interested in unique index violations (specifically, violations of amcanunique access method unique indexes); it will not do anything with NULL constraint violations, as the MySQL feature does, for example. It also doesn't care about exclusion constraints, even though it's currently possible to defer exclusion constraint enforcement in a way that is analogous to how it's done for btree indexes. 2) The clause that I've proposed is, as you'll have already gathered, spelt slightly differently to any of these systems. I'm not particularly attached to how I've spelt it. I've spelt it this way so as to avoid suggesting that it's fully compatible with the MySQL feature. I don't think we want weird NULLness behavior, but I may be mistaken. If we want that additional behavior, we probably want it as an optional adjunct, or perhaps an entirely independent clause to what I've specified here. 3) RETURNING is expanded - RETURNING REJECTS * is now possible where that makes sense. It is possible for clients to interrogate the wire protocol to see the number of rows affected (this can be done with a PQcmdTuples() call by libpq clients, for example), and see how that compares with the number of tuples they tried to insert. In addition, a client can use RETURNING to see what rows were successfully inserted. Here is a simple example session that shows this usage: postgres=# create table tab(id int4 primary key, val text); CREATE TABLE postgres=# insert into tab(id, val) values(1, 'val'), (3, 'val'), (4, 'val'); INSERT 0 3 postgres=# insert into tab(id, val) values(1, 'nosert'), (2, 'nosert'), (3, 'nosert') on duplicate key ignore returning id; id 2 (1 row) INSERT 0 1 postgres=# select xmin, * from tab; xmin | id | val ++- 580843 | 1 | val 580843 | 3 | val 580843 | 4 | val 580844 | 2 | nosert (4 rows) If this all happened in two separate, concurrent sessions, the transaction with xid 580844 might well have blocked on the outcome of 580843 in the usual way (ultimately inserting or doing nothing as appropriate for each tuple), just as you'd expect. Note, however, that the xid of the noserter is only seen here for the tuple that it actually successfully inserted - in effect, the noserter session does not lock tuples that it did not originally insert. That's consistent with MySQL's INSERT IGNORE, for example. However, some people might prefer it if this did share lock rows that prevented tuples from being inserted, or if that was at least an option. That would be something quite a bit closer to a fully-fledged upsert - I imagine we'd have to lock the old tuple using EvalPlanQual, and if a submission did that correctly then I think we'd be well on our way to solving all the problems. RETURNING REJECTS is a way of getting the inverse of affected tuples of an ordinary INSERT...RETURNING statement - rather than returning rows successfully inserted only (i.e. tuples where either a BEFORE trigger returned NULL or we didn't proceed with insertion), we returning rows that were not successfully inserted only. This is only meaningful in the context of INSERT...ON DUPLICATE KEY IGNORE. Sure, clients could figure this out for themselves using vanilla RETURNING *, but I think that this addition is justified by the fact that it's generally expected that the number of rejects will be relatively small. People are going to want to use this feature in an ad-hoc manner within a psql session, and supporting this will particularly help there. It might be nice to find a way to indicate the exact constraint violated per row returned, but I preferred to wait and see if people thought it was a good idea generally. Use-cases = The use-cases for this feature are fairly obvious. It can be used as a way of merging tuples easily, if clients can live with the present limitations. In addition to that, the feature is may be of value to clients of the logical change-set generation infrastructure (such as the proposed BDR contrib module[3]), if and when that infrastructure is finally committed. Currently, the BDR prototype simply balks at INSERT conflicts. This is because the amount of subtransactions/xids generated to do anything else is considered prohibitive. If Postgres is ever going to solve the kinds of problems
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On 08/30/2013 03:09 PM, Peter Geoghegan wrote: The attached WIP patch implements this for Postgres, with a few notable differences: Thank you for addressing this. If nobody is going to hack out MERGE, we could really use at least this feature. 3) RETURNING is expanded - RETURNING REJECTS * is now possible where that makes sense. Oh, nifty! OK, now I can *really* use this feature. This patch is a spin-off from a broader effort to implement INSERT...ON DUPLICATE KEY UPDATE (upsert). During the 2012 developer Yeah, I was wondering when we'd get to that. Obviously there will be users clamoring for it ... Unlike some other systems like DB2, we have always allowed BEFORE ROW triggers to execute arbitrary SQL. I've frequently thought this was a bit of a wart (e.g. [10]), and certainly not supportive of sensible, idiomatic trigger use, but there isn't much we can do about it at this stage. Right now, BEFORE ROW triggers will still fire when the new code decides to not do the insert. It certainly wouldn't be acceptable to allow before triggers to run *after* the first phase of speculative insertion, because they might try and access an index with an exclusive locked buffer, resulting in backend self-deadlock. Besides, we cannot really know *what* to lock until after the before triggers fire. That leaves us with two options, as far as I can tell: 1. Just have users live with those semantics. This is what the patch does presently, and anyone who is using triggers in what I consider to be a sensible way doesn't have to care. For everyone else, it's a gotcha that they have to deal with, to be noted prominently. +1. We already allow BEFORE triggers to violate referential integrity. I don't think that allowing them to behave oddly around INSERT ... IGNORE is a problem compared to that. We just need to document it, so that users know that their BEFORE code will fire even if the INSERT is being ignored, and that a BEFORE trigger can cause an INSERT ... IGNORE to error out. I have done no performance testing to date. Reviewers will want to pay attention to the performance implications, particularly in the regular insert case; it's conceivable that I've regressed things, though I don't specifically suspect that I have. Yeah, we'll also want to document the performance overhead for the bulk loading case. I know I'll want to use this syntax as a primitive form of MERGE, and I'll want to know what it costs me. Does this work with multiple VALUES rows? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Fri, Aug 30, 2013 at 3:40 PM, Josh Berkus j...@agliodbs.com wrote: Does this work with multiple VALUES rows? Yes. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
Hi, This is awesome. On 2013-08-30 15:09:59 -0700, Peter Geoghegan wrote: 1) The patch is only interested in unique index violations (specifically, violations of amcanunique access method unique indexes); it will not do anything with NULL constraint violations, as the MySQL feature does, for example. It also doesn't care about exclusion constraints, even though it's currently possible to defer exclusion constraint enforcement in a way that is analogous to how it's done for btree indexes. All that seems sane to me. I very, very much do not want it to deal with NOT NULL violations. [syntax] Haven't thought about syntax at all so far, that's food for another day. Use-cases = ... In addition to that, the feature is may be of value to clients of the logical change-set generation infrastructure (such as the proposed BDR contrib module[3]), if and when that infrastructure is finally committed. Currently, the BDR prototype simply balks at INSERT conflicts. This is because the amount of subtransactions/xids generated to do anything else is considered prohibitive. If Postgres is ever going to solve the kinds of problems that BDR proposes to solve well (except for the relatively narrow use-case where balking at INSERT conflicts is acceptable), it needs something like what I've proposed here. The only alternative is probably to replay a transaction without using subtransactions, and if that fails, remember that fact and wrap every single change in a subxact. This is really going to be painful for large transactions. Exactly. In completely unscientific tests it reduces speed to less than 1/10 of the version without subxacts. Without a single conflict that is. Andres privately said some weeks back (prior to my sharing of this information, and even prior to writing most of the code posted here) that he'd like to see an INSERT...ON DUPLICATE KEY LOCK. I suppose that he must have meant that he'd like to see shared locks obtained on rows that blocked a noserter from actually inserting some of their proposed tuples, for the purposes of conflict resolution (which, as I've said, this doesn't do). Perhaps he'd like to comment on these issues here. Ok, so what we would like to do is basically to follow a protocol (simplified) like: 1) replay INSERT, using ON DUPLICATE IGNORE 2) if INSERT succeeded, be happy, no conflict (another INSERT with the same key might conflict though) 3) if the INSERT failed, lock tuple for UPDATE 4) if the tuple got deleted by another session, goto 1 5) check whether local tuple or the already inserted tuple wins conflict resolution 6) UPDATE or skip accordingly If there's a variant of INSERT ... ON DUPLICATE that gets a FOR UPDATE lock on the existing row this gets simpler because there's no chance anymore we need to loop or look for other versions of the row. Makes sense? Mechanism = [...] From a high level, the patch works by adding something that is tentatively internally referred to as speculative insertion in lower-level code comments. The basic approach is: * Within ExecInsert, when inserting into a regular heap relation from a tuple table slot (after BEFORE triggers fire but before inserting the heap tuple), there is an extra function call to a new function, ExecLockIndexTuples(). * For each applicable amcanunique unique index, ExecLockIndexTuples() calls a new am method, amlock. For each unique index, we end up acquiring a write buffer lock in a similar fashion to _bt_doinsert. When we return from the new amlock function, we report whether or not there was a duplicate without having inserted any index tuples or having otherwise done something that requires a heap tuple to be present. I like to call this point the point of no return, where individual unique indexes commit themselves to insertion (which is to say, the point after which a unique constraint violation becomes impossible but before actual index tuple insertion in _bt_doinsert [9]). Another duplicate key cannot be inserted by anyone else until that write lock is released, which it won't be until insertion or IGNORE, for the same reason why that's currently true once control reaches the point of no return. A pagesplit might need to happen before insertion proper of that index tuple, for example, and that's okay. * If we have total consensus among all unique indexes, we proceed with a heap insertion of the tuple formed from the tuple table slot ExecInsert() was called in respect of. Yes, this means that potentially multiple exclusive buffer locks are held for the duration of the heap tuple insertion. If not, we release the locks early and return NULL from ExecInsert() before heap insertion ever happens (and certainly before index tuple insertion ever happens). * If and when we get to ExecInsertIndexTuples(), and ultimately to _bt_doinsert, the new _bt_doinsert knows that for any unique indexes it sees, that the process has
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund and...@2ndquadrant.com wrote: This is awesome. Thanks. All that seems sane to me. I very, very much do not want it to deal with NOT NULL violations. Sure. But there's nothing stopping us from doing that as a totally orthogonal thing. Not that I personally find it to be terribly compelling or anything. It's an easy addition, but I'm certainly not going to try and mandate it as integral to what I've done here if that doesn't suit you. Ok, so what we would like to do is basically to follow a protocol (simplified) like: 1) replay INSERT, using ON DUPLICATE IGNORE 2) if INSERT succeeded, be happy, no conflict (another INSERT with the same key might conflict though) 3) if the INSERT failed, lock tuple for UPDATE 4) if the tuple got deleted by another session, goto 1 5) check whether local tuple or the already inserted tuple wins conflict resolution 6) UPDATE or skip accordingly Right, I thought that's what you meant. I'll have to ponder it further. However, I don't think locking the old tuple(s) is mandatory to make this a useful feature - as I've noted, MySQL doesn't do that. That said, I really want to support that, perhaps as another DUPLICATE KEY variant. As I've noted, if we had that, we almost wouldn't need upsert - loosely speaking it would be mere syntactic sugar on top of what you require. If there's a variant of INSERT ... ON DUPLICATE that gets a FOR UPDATE lock on the existing row this gets simpler because there's no chance anymore we need to loop or look for other versions of the row. Makes sense? Perfect sense. Puh. It took me some time to even start to understand what you're doing here... While I ponder on it some more, could you explain whether I understood something correctly? Consider the following scenario: CREATE TABLE foo(a int UNIQUE, b int UNIQUE); S1: INSERT INTO foo(0, 0); S1: BEGIN; S1: INSERT INTO foo(1, 1); S1: SELECT pg_sleep(3600); S2: BEGIN; S2: INSERT INTO foo(2, 1); S3: SELECT * FROM foo WHERE a = 0; Won't S2 in this case exclusively lock a page in foo_a_key (the one where 2 will reside on) for 3600s, while it's waiting for S1 to finish while getting the speculative insertion into foo_b_key? Yes. The way the patch currently handles that case is unacceptable. It needs to be more concerned about holding an exclusive lock on earlier-locked indexes when locking the second and subsequent indexes iff it blocks on another transaction in the course of locking the second and subsequent indexes. ISTM we could use something like a new lock level to make that work more sensibly, basically something like LW_FOR_UPDATE which does not conflict with _SHARED but conflicts with _EXCLUSIVE. I'll ponder it further. There are probably a number of options. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers