Re: [HACKERS] dynamic shared memory
On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby j...@nasby.net wrote: On 8/13/13 8:09 PM, Robert Haas wrote: is removed, the segment automatically goes away (we could allow for server-lifespan segments as well with only trivial changes, but I'm not sure whether there are compelling use cases for that). To clarify... you're talking something that would intentionally survive postmaster restart? I don't see use for that either... No, I meant something that would live as long as the postmaster and die when it dies. Ignorant question... is ResourceOwner related to memory contexts? If not, would memory contexts be a better way to handle memory segment cleanup? Nope. :-) There are quite a few problems that this patch does not solve. First, It also doesn't provide any mechanism for notifying backends of a new segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be useful to have a standard method or three of doing that; perhaps just some convenience functions wrapping the methods mentioned in comments. I don't see that as being generally useful. Backends need to know more than there's a new segment, and in fact most backends won't care about most new segments. A background worker needs to know about the new segment *that it should attach*, but we have bgw_main_arg. If we end up using this facility for any system-wide purposes, I imagine we'll do that by storing the segment ID in the main shared memory segment someplace. Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared memory is something we've needed forever! :) Thanks. Other comments... + * If the state file is empty or the contents are garbled, it probably means + * that the operating system rebooted before the data written by the previous + * postmaster made it to disk. In that case, we can just ignore it; any shared + * memory from before the reboot should be gone anyway. I'm a bit concerned about this; I know it was possible in older versions for the global shared memory context to be left behind after a crash and needing to clean it up by hand. Dynamic shared mem potentially multiplies that by 100 or more. I think it'd be worth changing dsm_write_state_file so it always writes a new file and then does an atomic mv (or something similar). I agree that the possibilities for leftover shared memory segments are multiplied with this new facility, and I've done my best to address that. However, I don't agree that writing the state file in a different way would improve anything. +* If some other backend exited uncleanly, it might have corrupted the +* control segment while it was dying. In that case, we warn and ignore +* the contents of the control segment. This may end up leaving behind +* stray shared memory segments, but there's not much we can do about +* that if the metadata is gone. Similar concern... in this case, would it be possible to always write updates to an un-used slot and then atomically update a pointer? This would be more work than what I suggested above, so maybe just a TODO for now... Though... is there anything a dying backend could do that would corrupt the control segment to the point that it would screw up segments allocated by other backends and not related to the dead backend? Like marking a slot as not used when it is still in use and isn't associated with the dead backend? Sure. A messed-up backend can clobber the control segment just as it can clobber anything else in shared memory. There's really no way around that problem. If the control segment has been overwritten by a memory stomp, we can't use it to clean up. There's no way around that problem except to not the control segment, which wouldn't be better. Should this (in dsm_attach) +* If you're hitting this error, you probably want to use attempt to be +* If you're hitting this error, you probably want to attempt to ? Good point. Should dsm_impl_op sanity check the arguments after op? I didn't notice checks in the type-specific code but I also didn't read all of it... are we just depending on the OS to sanity-check? Sanity-check for what? Also, does the GUC code enforce that the GUC must always be something that's supported? If not then the error in dsm_impl_op should be more user-friendly. Yes. I basically stopped reading after dsm_impl_op... the rest of the stuff was rather over my head. :-) Thanks for your interest! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dynamic shared memory
On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund and...@2ndquadrant.com wrote: The way I've designed it, no. If what we expect to be the control segment doesn't exist or doesn't conform to our expectations, we just assume that it's not really the control segment after all - e.g. someone rebooted, clearing all the segments, and then an unrelated process (malicious, perhaps, or just a completely different cluster) reused the same name. This is similar to what we do for the main shared memory segment. The case I am mostly wondering about is some process crashing and overwriting random memory. We need to be pretty sure that we'll never fail partially through cleaning up old segments because they are corrupted or because we died halfway through our last cleanup attempt. Right. I had those considerations in mind and I believe I have nailed the hatch shut pretty tight. The cleanup code is designed never to die with an error. Of course it might, but it would have to be something like an out of memory failure or similar that isn't really what we're concerned about here. You are welcome to look for holes, but these issues are where most of my brainpower went during development. That's true, but that decision has not been uncontroversial - e.g. the NetBSD guys don't like it, because they have a big performance difference between those two types of memory. We have to balance the possible harm of one more setting against the benefit of letting people do what they want without needing to recompile or modify code. But then, it made them fix the issue afaik :P Pah. :-) You can look at it while the server's running. That's what debuggers are for. Tough crowd. I like it. YMMV. I would never advocate deliberately trying to circumvent a carefully-considered OS-level policy decision about resource utilization, but I don't think that's the dynamic here. I think if we insist on predetermining the dynamic shared memory implementation based on the OS, we'll just be inconveniencing people needlessly, or flat-out making things not work. [...] But using file-backed memory will *suck* performancewise. Why should we ever want to offer that to a user? That's what I was arguing about primarily. I see. There might be additional writeback traffic, but it might not be that bad in common cases. After all the data's pretty hot. If we're SURE that a Linux user will prefer posix to sysv or mmap or none in 100% of cases, and that a NetBSD user will always prefer sysv over mmap or none in 100% of cases, then, OK, sure, let's bake it in. But I'm not that sure. I think posix shmem will be preferred to sysv shmem if present, in just about any relevant case. I don't know of any system with lower limits on posix shmem than on sysv. OK, how about this SysV doesn't allow extending segments, but mmap does. The thing here is that you're saying remove mmap and keep sysv but Noah suggested to me that we remove sysv and keep mmap. This suggests to me that the picture is not so black and white as you think it is. I shared your opinion that preferred_address is never going to be reliable, although FWIW Noah thinks it can be made reliable with a large-enough hammer. I think we need to have the arguments for that on list then. Those are pretty damn fundamental design decisions. I for one cannot see how you even remotely could make that work a) on windows (check the troubles we have to go through to get s_b consistently placed, and that's directly after startup) b) 32bit systems. Noah? But even if it isn't reliable, there doesn't seem to be all that much value in forbidding access to that part of the OS-provided API. In the world where it's not reliable, it may still be convenient to map things at the same address when you can, so that pointers can't be used. Of course you'd have to have some fallback strategy for when you don't get the same mapping, and maybe that's painful enough that there's no point after all. Or maybe it's worth having one code path for relativized pointers and another for non-relativized pointers. It seems likely to me that will end up with untested code in that case. Or even unsupported platforms. Maybe. I think for the amount of code we're talking about here, it's not worth getting excited about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
Hi, On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote: On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund and...@2ndquadrant.com wrote: While I ponder on it some more, could you explain whether I understood something correctly? Consider the following scenario: CREATE TABLE foo(a int UNIQUE, b int UNIQUE); S1: INSERT INTO foo(0, 0); S1: BEGIN; S1: INSERT INTO foo(1, 1); S1: SELECT pg_sleep(3600); S2: BEGIN; S2: INSERT INTO foo(2, 1); S3: SELECT * FROM foo WHERE a = 0; Won't S2 in this case exclusively lock a page in foo_a_key (the one where 2 will reside on) for 3600s, while it's waiting for S1 to finish while getting the speculative insertion into foo_b_key? Yes. The way the patch currently handles that case is unacceptable. It needs to be more concerned about holding an exclusive lock on earlier-locked indexes when locking the second and subsequent indexes iff it blocks on another transaction in the course of locking the second and subsequent indexes. After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. What I had in mind when playing around with this is the following trick: Just as in your patch split index insertion into two phases and perform one of them before inserting the heap tuple. In what you called speculative insertion don't stop at locking the page, but actually insert an index tuple in the index. As we haven't yet inserted the heap tuple we obviously can't insert an actual tuple. Instead set a flag on the item and reuse the the item pointer to store the xid of the inserting process. I coined those items insertion promises... Index searches will ignore promises they see, but concurrent index insertions will notice it's a promise and wait for the xid (or not, if it has aborted). That fits pretty well into the existing btree code. If the insertion of an index promise worked for all unique indexes and the heap tuple has been inserted, convert those promises into actual item pointers pointing to the heap tuple. That probably requires a second search and some cuteness to find the correct pointer because of concurrent splits, but that's not too bad (possibly you can do away with that if we continue to hold a pin). The fact that now xids can be in the index creates some slight complications because it now needs to be vacuumed - but that's relatively easy to fit into the bulkdelete api and I don't think the contortions to the vacuum code will be too bad. Imo that solution is fairly elegant because it doesn't cause any heap bloat, and it causes the same amount of index bloat insert-into-heap-first type of solutions cause. It also has pretty good concurrency behaviour because you never need to retain a page lock for longer than a normal index insertion. It's certainly a bit more complex than your solution tho... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [v9.4] row level security
KaiGai, * Kohei KaiGai (kai...@kaigai.gr.jp) wrote: The point we shouldn't forget is information leakage via covert-channel has less grade of threat than leakage via main-channel, because of much less bandwidth. While true, this argument can't be used to simply ignore any and all covert channels. There are covert channels which are *not* much less bandwidth, and the Filtered Rows one is one of those- it's simply too big to ignore. There are likely other which are equally large and until we clean those up our RLS implementation will be questioned by our users. This does not mean that we need to clean up all covert channels and things which are clearly intractable don't need to be addressed (eg: the unique constraint situation; we can't both guarantee uniqueness and hide the existance of an entry). Security community also concludes it is not avoidable nature as long as human can observe system behavior and estimate something, This particular case is actually beyond 'estimation'. Even if attacker has enough knowledge, the ratio they can estimate hidden values is very limited because of much less bandwidth of covert channels. You really need to back away from this argument in this case. The example shown isn't based on estimation and provides quite high bandwidth because it can be run by a computer. This argument can't be used for every situation where information is leaked. However, in general, it is impossible to eliminate anything in spite of less degree of threats because of smaller bandwidth. So, I don't think this is an issue to spent much efforts to solve. I don't see a lot of complexity required to fix this specific case. A great deal of effort will be involved in going through the rest of the code and various options and working to eliminate other similar cases, but that's a reality we need to deal with. Hopefully the cases found will not require a lot of additional code to deal with. We will need to be mindful of new code which adds leaks or changes behavior such that RLS doesn't function properly (hopefully, sufficient regression tests will help to address that as well). Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump
On 08/18/2013 05:40 PM, Tom Lane wrote: Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: While working on upgrading the database of the search system on postgresql.org to 9.2 I noticed that the dumps that pg_dump generates on that system are actually invalid and cannot be reloaded without being hacked on manually... CREATE TEXT SEARCH CONFIGURATION pg ( PARSER = pg_catalog.default ); CREATE FUNCTION test() RETURNS INTEGER LANGUAGE sql SET default_text_search_config TO 'public.pg' AS $$ SELECT 1; $$; once you dump that you will end up with an invalid dump because the function will be dumped before the actual text search configuration is (re)created. I don't think it will work to try to fix this by reordering the dump; it's too easy to imagine scenarios where that would lead to circular ordering constraints. What seems like a more workable answer is for CREATE FUNCTION to not attempt to validate SET clauses when check_function_bodies is off, or at least not throw a hard error when the validation fails. (I see for instance that if you try ALTER ROLE joe SET default_text_search_config TO nonesuch; you just get a notice and not an error.) However, I don't recall if there are any places where we assume the SET info was pre-validated by CREATE/ALTER FUNCTION. any further insights into that issue? - seems a bit silly to have an open bug that actually prevents us from taking (restorable) backups of the search system on our own website... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump
* Stefan Kaltenbrunner (ste...@kaltenbrunner.cc) wrote: On 08/18/2013 05:40 PM, Tom Lane wrote: Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: While working on upgrading the database of the search system on postgresql.org to 9.2 I noticed that the dumps that pg_dump generates on that system are actually invalid and cannot be reloaded without being hacked on manually... CREATE TEXT SEARCH CONFIGURATION pg ( PARSER = pg_catalog.default ); CREATE FUNCTION test() RETURNS INTEGER LANGUAGE sql SET default_text_search_config TO 'public.pg' AS $$ SELECT 1; $$; once you dump that you will end up with an invalid dump because the function will be dumped before the actual text search configuration is (re)created. I don't think it will work to try to fix this by reordering the dump; it's too easy to imagine scenarios where that would lead to circular ordering constraints. What seems like a more workable answer is for CREATE FUNCTION to not attempt to validate SET clauses when check_function_bodies is off, or at least not throw a hard error when the validation fails. (I see for instance that if you try ALTER ROLE joe SET default_text_search_config TO nonesuch; you just get a notice and not an error.) However, I don't recall if there are any places where we assume the SET info was pre-validated by CREATE/ALTER FUNCTION. any further insights into that issue? - seems a bit silly to have an open bug that actually prevents us from taking (restorable) backups of the search system on our own website... It would seem that a simple solution would be to add an elevel argument to ProcessGUCArray and then call it with NOTICE in the case that check_function_bodies is true. None of the contrib modules call ProcessGUCArray, but should we worry that some external module does? This doesn't address Tom's concern that we may trust in the SET to ensure that the value stored is valid. That seems like it'd be pretty odd given how we typically handle GUCs, but I've not done a comprehensive review to be sure. Like Stefan, I'd really like to see this fixed, and sooner rather than later, so we can continue the process of upgrading our systems to 9.2.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund and...@2ndquadrant.com wrote: Won't S2 in this case exclusively lock a page in foo_a_key (the one where 2 will reside on) for 3600s, while it's waiting for S1 to finish while getting the speculative insertion into foo_b_key? Yes. The way the patch currently handles that case is unacceptable. It needs to be more concerned about holding an exclusive lock on earlier-locked indexes when locking the second and subsequent indexes iff it blocks on another transaction in the course of locking the second and subsequent indexes. I think that the fix here may be to do something analogous to what the code already does: release all buffer locks, wait for the other transaction to commit/abort, and try again. The difference being that we try again across all unique indexes rather than just this one. I have other obligations to take care of over the next couple of days, so I don't really have time to code that up and satisfy myself that it's robust; I'd prefer to have something more concrete to back this thought up, but that's my immediate reaction for what it's worth. We're looking for the first duplicate. So it would probably be okay for the IGNORE case to not bother retrying and re-locking if the other transaction committed (which, at least very broadly speaking, is the likely outcome). After some thinking I don't think any solution primarily based on holding page level locks across other index operations is going to scale ok. Just to be clear: I'm certainly not suggesting that it's okay to have exclusive buffer locks held for any amount of time longer than instantaneous. Furthermore, it isn't necessary to do anything different to what we currently do with non-unique btree indexes during speculative insertion - I don't treat them any differently, and they only get locked in the usual way at the usual time, after heap insertion. Much of the time, there'll only be a primary key index to lock, and you'll have increased the window of the exclusive write-lock (compared to the existing code when doing a regular insertion) by only a very small amount - the duration of heap tuple insertion (the baseline here is the duration of finding an insertion location on page for index tuple, tuple insertion + possible page split). We're talking about a write-lock on a btree leaf page, which is only going to block other sessions from *finishing off* tuple insertion, and then only for a minority of such insertions much of the time. It's probably worth doing as much work up-front as possible (in terms of asking questions about if we even want to lock the indexes and so on, within ExecLockIndexTuples()), so as to decrease the window that those locks are held. That's just an obvious optimization. As users add unique indexes to tables, things do of course get worse and worse. However, since in order for any of this to matter, there has to be lots of concurrent activity in clustered ranges of values, it's pretty likely that you'd conclude that tuple insertion should not go ahead early on, and not end up doing too much exclusive/write locking per tuple. The question must be asked: if you don't think this will scale okay, what are you comparing it to? The most useful basis of comparison is likely to be other systems. Asking about uniqueness across many unique indexes, and getting them all to commit to their unanimous position for a moment in time is inherently involved - I don't see a way to do it other than lock values, and I don't see any granularity at which that's possible other than the btree page granularity. What I had in mind when playing around with this is the following trick: Just as in your patch split index insertion into two phases and perform one of them before inserting the heap tuple. In what you called speculative insertion don't stop at locking the page, but actually insert an index tuple in the index. As we haven't yet inserted the heap tuple we obviously can't insert an actual tuple. Instead set a flag on the item and reuse the the item pointer to store the xid of the inserting process. I coined those items insertion promises... I did consider that sort of approach from an early stage - Simon suggested it to me privately early last year. Actually, I even pointed out in the 2012 developer meeting that there were unused IndexTuple flag bits available for such purposes. Index searches will ignore promises they see, but concurrent index insertions will notice it's a promise and wait for the xid (or not, if it has aborted). If inserters must block, they'll have to restart - I'm sure you'll agree that they can't sit on even a shared/read buffer lock indefinitely. So, have you really saved anything? Especially since you'll have to search at least twice (more shared locks on *root* pages) where my scheme only needs to search once. Also, are you really sure that index searches should have license to ignore these promises? What about dirty snapshots?
Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE
On Sat, Aug 31, 2013 at 7:38 PM, Peter Geoghegan p...@heroku.com wrote: Imo that solution is fairly elegant because it doesn't cause any heap bloat, and it causes the same amount of index bloat insert-into-heap-first type of solutions cause. I don't think that's a reasonable comparison. Bloating indexes with dead *duplicate* tuples is, in general, worse than doing the same with the heap. If you end up with a heavily bloated index with lots of duplicates, that will adversely affect performance worse than with non-duplicate index tuples (see [1] for example). That was one reason that HOT helped as much as it did, I believe. Bear in mind also that one unique index could hardly ever have real duplicates, but it would still get broken promise/bloat tuples because another, later index said it had a duplicate. All of the indexes get bloated (quite possibly) just because of a duplicate in one. With a table with many unique indexes and many reasons to decide to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY IGNORE, it isn't hard to imagine them all becoming heavily bloated very quickly. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers