Re: [HACKERS] dynamic shared memory

2013-08-31 Thread Robert Haas
On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby j...@nasby.net wrote:
 On 8/13/13 8:09 PM, Robert Haas wrote:
 is removed, the segment automatically goes away (we could allow for
 server-lifespan segments as well with only trivial changes, but I'm
 not sure whether there are compelling use cases for that).

 To clarify... you're talking something that would intentionally survive
 postmaster restart? I don't see use for that either...

No, I meant something that would live as long as the postmaster and
die when it dies.

 Ignorant question... is ResourceOwner related to memory contexts? If not,
 would memory contexts be a better way to handle memory segment cleanup?

Nope.  :-)

 There are quite a few problems that this patch does not solve.  First,

 It also doesn't provide any mechanism for notifying backends of a new
 segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be
 useful to have a standard method or three of doing that; perhaps just some
 convenience functions wrapping the methods mentioned in comments.

I don't see that as being generally useful.  Backends need to know
more than there's a new segment, and in fact most backends won't
care about most new segments.  A background worker needs to know about
the new segment *that it should attach*, but we have bgw_main_arg.  If
we end up using this facility for any system-wide purposes, I imagine
we'll do that by storing the segment ID in the main shared memory
segment someplace.

 Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared
 memory is something we've needed forever! :)

Thanks.

 Other comments...

 + * If the state file is empty or the contents are garbled, it probably
 means
 + * that the operating system rebooted before the data written by the
 previous
 + * postmaster made it to disk.  In that case, we can just ignore it; any
 shared
 + * memory from before the reboot should be gone anyway.

 I'm a bit concerned about this; I know it was possible in older versions for
 the global shared memory context to be left behind after a crash and needing
 to clean it up by hand. Dynamic shared mem potentially multiplies that by
 100 or more. I think it'd be worth changing dsm_write_state_file so it
 always writes a new file and then does an atomic mv (or something similar).

I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that.  However, I don't agree that writing the state file in a
different way would improve anything.

 +* If some other backend exited uncleanly, it might have corrupted
 the
 +* control segment while it was dying.  In that case, we warn and
 ignore
 +* the contents of the control segment.  This may end up leaving
 behind
 +* stray shared memory segments, but there's not much we can do
 about
 +* that if the metadata is gone.

 Similar concern... in this case, would it be possible to always write
 updates to an un-used slot and then atomically update a pointer? This would
 be more work than what I suggested above, so maybe just a TODO for now...

 Though... is there anything a dying backend could do that would corrupt the
 control segment to the point that it would screw up segments allocated by
 other backends and not related to the dead backend? Like marking a slot as
 not used when it is still in use and isn't associated with the dead backend?

Sure.  A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory.  There's really no way
around that problem.  If the control segment has been overwritten by a
memory stomp, we can't use it to clean up.  There's no way around that
problem except to not the control segment, which wouldn't be better.

 Should this (in dsm_attach)

 +* If you're hitting this error, you probably want to use attempt to

 be

 +* If you're hitting this error, you probably want to attempt to

 ?

Good point.

 Should dsm_impl_op sanity check the arguments after op? I didn't notice
 checks in the type-specific code but I also didn't read all of it... are we
 just depending on the OS to sanity-check?

Sanity-check for what?

 Also, does the GUC code enforce that the GUC must always be something that's
 supported? If not then the error in dsm_impl_op should be more
 user-friendly.

Yes.

 I basically stopped reading after dsm_impl_op... the rest of the stuff was
 rather over my head.

:-)

Thanks for your interest!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dynamic shared memory

2013-08-31 Thread Robert Haas
On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund and...@2ndquadrant.com wrote:
 The way I've designed it, no.  If what we expect to be the control
 segment doesn't exist or doesn't conform to our expectations, we just
 assume that it's not really the control segment after all - e.g.
 someone rebooted, clearing all the segments, and then an unrelated
 process (malicious, perhaps, or just a completely different cluster)
 reused the same name.  This is similar to what we do for the main
 shared memory segment.

 The case I am mostly wondering about is some process crashing and
 overwriting random memory. We need to be pretty sure that we'll never
 fail partially through cleaning up old segments because they are
 corrupted or because we died halfway through our last cleanup attempt.

Right.  I had those considerations in mind and I believe I have nailed
the hatch shut pretty tight.  The cleanup code is designed never to
die with an error.  Of course it might, but it would have to be
something like an out of memory failure or similar that isn't really
what we're concerned about here.  You are welcome to look for holes,
but these issues are where most of my brainpower went during
development.

 That's true, but that decision has not been uncontroversial - e.g. the
 NetBSD guys don't like it, because they have a big performance
 difference between those two types of memory.  We have to balance the
 possible harm of one more setting against the benefit of letting
 people do what they want without needing to recompile or modify code.

 But then, it made them fix the issue afaik :P

Pah.  :-)

 You can look at it while the server's running.

 That's what debuggers are for.

Tough crowd.  I like it.  YMMV.

 I would never advocate deliberately trying to circumvent a
 carefully-considered OS-level policy decision about resource
 utilization, but I don't think that's the dynamic here.  I think if we
 insist on predetermining the dynamic shared memory implementation
 based on the OS, we'll just be inconveniencing people needlessly, or
 flat-out making things not work. [...]

 But using file-backed memory will *suck* performancewise. Why should we
 ever want to offer that to a user? That's what I was arguing about
 primarily.

I see.  There might be additional writeback traffic, but it might not
be that bad in common cases.  After all the data's pretty hot.

 If we're SURE
 that a Linux user will prefer posix to sysv or mmap or none in
 100% of cases, and that a NetBSD user will always prefer sysv over
 mmap or none in 100% of cases, then, OK, sure, let's bake it in.
 But I'm not that sure.

 I think posix shmem will be preferred to sysv shmem if present, in just
 about any relevant case. I don't know of any system with lower limits on
 posix shmem than on sysv.

OK, how about this  SysV doesn't allow extending segments, but
mmap does.  The thing here is that you're saying remove mmap and keep
sysv but Noah suggested to me that we remove sysv and keep mmap.
This suggests to me that the picture is not so black and white as you
think it is.

 I shared your opinion that preferred_address is never going to be
 reliable, although FWIW Noah thinks it can be made reliable with a
 large-enough hammer.

 I think we need to have the arguments for that on list then. Those are
 pretty damn fundamental design decisions.
 I for one cannot see how you even remotely could make that work a) on
 windows (check the troubles we have to go through to get s_b
 consistently placed, and that's directly after startup) b) 32bit systems.

Noah?

 But even if it isn't reliable, there doesn't seem to be all that much
 value in forbidding access to that part of the OS-provided API.  In
 the world where it's not reliable, it may still be convenient to map
 things at the same address when you can, so that pointers can't be
 used.  Of course you'd have to have some fallback strategy for when
 you don't get the same mapping, and maybe that's painful enough that
 there's no point after all.  Or maybe it's worth having one code path
 for relativized pointers and another for non-relativized pointers.

 It seems likely to me that will end up with untested code in that
 case. Or even unsupported platforms.

Maybe.  I think for the amount of code we're talking about here, it's
not worth getting excited about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Andres Freund
Hi,

On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote:
 On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund and...@2ndquadrant.com wrote:
  While I ponder on it some more, could you explain whether I understood
  something correctly? Consider the following scenario:
 
  CREATE TABLE foo(a int UNIQUE, b int UNIQUE);
 
  S1: INSERT INTO foo(0, 0);
  S1: BEGIN;
  S1: INSERT INTO foo(1, 1);
  S1: SELECT pg_sleep(3600);
  S2: BEGIN;
  S2: INSERT INTO foo(2, 1);
  S3: SELECT * FROM foo WHERE a = 0;
 
  Won't S2 in this case exclusively lock a page in foo_a_key (the one
  where 2 will reside on) for 3600s, while it's waiting for S1 to finish
  while getting the speculative insertion into foo_b_key?
 
 Yes. The way the patch currently handles that case is unacceptable. It
 needs to be more concerned about holding an exclusive lock on
 earlier-locked indexes when locking the second and subsequent indexes
 iff it blocks on another transaction in the course of locking the
 second and subsequent indexes.

After some thinking I don't think any solution primarily based on
holding page level locks across other index operations is going to scale
ok.

What I had in mind when playing around with this is the following trick:
Just as in your patch split index insertion into two phases and perform
one of them before inserting the heap tuple. In what you called
speculative insertion don't stop at locking the page, but actually
insert an index tuple in the index. As we haven't yet inserted the heap
tuple we obviously can't insert an actual tuple. Instead set a flag on
the item and reuse the the item pointer to store the xid of the
inserting process. I coined those items insertion promises...

Index searches will ignore promises they see, but concurrent index
insertions will notice it's a promise and wait for the xid (or not, if
it has aborted). That fits pretty well into the existing btree code.  If
the insertion of an index promise worked for all unique indexes and the
heap tuple has been inserted, convert those promises into actual item
pointers pointing to the heap tuple. That probably requires a second
search and some cuteness to find the correct pointer because of
concurrent splits, but that's not too bad (possibly you can do away with
that if we continue to hold a pin).

The fact that now xids can be in the index creates some slight
complications because it now needs to be vacuumed - but that's
relatively easy to fit into the bulkdelete api and I don't think the
contortions to the vacuum code will be too bad.

Imo that solution is fairly elegant because it doesn't cause any heap
bloat, and it causes the same amount of index bloat
insert-into-heap-first type of solutions cause. It also has pretty
good concurrency behaviour because you never need to retain a page lock
for longer than a normal index insertion.
It's certainly a bit more complex than your solution tho...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-08-31 Thread Stephen Frost
KaiGai,

* Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
 The point we shouldn't forget is information leakage via covert-channel
 has less grade of threat than leakage via main-channel, because of
 much less bandwidth.

While true, this argument can't be used to simply ignore any and all
covert channels.  There are covert channels which are *not* much less
bandwidth, and the Filtered Rows one is one of those- it's simply too
big to ignore.  There are likely other which are equally large and
until we clean those up our RLS implementation will be questioned by
our users.

This does not mean that we need to clean up all covert channels and
things which are clearly intractable don't need to be addressed (eg:
the unique constraint situation; we can't both guarantee uniqueness
and hide the existance of an entry).

 Security community also concludes it is not avoidable nature as long
 as human can observe system behavior and estimate something,

This particular case is actually beyond 'estimation'. 

 Even if attacker has enough knowledge, the ratio they can estimate
 hidden values is very limited because of much less bandwidth of
 covert channels.

You really need to back away from this argument in this case.  The
example shown isn't based on estimation and provides quite high
bandwidth because it can be run by a computer.  This argument can't
be used for every situation where information is leaked.

 However, in general, it is impossible to eliminate anything in spite of
 less degree of threats because of smaller bandwidth. So, I don't think
 this is an issue to spent much efforts to solve.

I don't see a lot of complexity required to fix this specific case.  A
great deal of effort will be involved in going through the rest of the
code and various options and working to eliminate other similar cases,
but that's a reality we need to deal with.  Hopefully the cases found
will not require a lot of additional code to deal with.  We will need to
be mindful of new code which adds leaks or changes behavior such that
RLS doesn't function properly (hopefully, sufficient regression tests
will help to address that as well).

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump

2013-08-31 Thread Stefan Kaltenbrunner
On 08/18/2013 05:40 PM, Tom Lane wrote:
 Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
 While working on upgrading the database of the search system on
 postgresql.org to 9.2 I noticed that the dumps that pg_dump generates on
 that system are actually invalid and cannot be reloaded without being
 hacked on manually...
 
 CREATE TEXT SEARCH CONFIGURATION pg (
 PARSER = pg_catalog.default );
 
 CREATE FUNCTION test() RETURNS INTEGER
 LANGUAGE sql SET default_text_search_config TO 'public.pg' AS $$
 SELECT 1;
 $$;
 
 once you dump that you will end up with an invalid dump because the
 function will be dumped before the actual text search configuration is
 (re)created.
 
 I don't think it will work to try to fix this by reordering the dump;
 it's too easy to imagine scenarios where that would lead to circular
 ordering constraints.  What seems like a more workable answer is for
 CREATE FUNCTION to not attempt to validate SET clauses when
 check_function_bodies is off, or at least not throw a hard error when
 the validation fails.  (I see for instance that if you try
 ALTER ROLE joe SET default_text_search_config TO nonesuch;
 you just get a notice and not an error.)
 
 However, I don't recall if there are any places where we assume the
 SET info was pre-validated by CREATE/ALTER FUNCTION.

any further insights into that issue? - seems a bit silly to have an
open bug that actually prevents us from taking (restorable) backups of
the search system on our own website...



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump

2013-08-31 Thread Stephen Frost
* Stefan Kaltenbrunner (ste...@kaltenbrunner.cc) wrote:
 On 08/18/2013 05:40 PM, Tom Lane wrote:
  Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes:
  While working on upgrading the database of the search system on
  postgresql.org to 9.2 I noticed that the dumps that pg_dump generates on
  that system are actually invalid and cannot be reloaded without being
  hacked on manually...
  
  CREATE TEXT SEARCH CONFIGURATION pg (
  PARSER = pg_catalog.default );
  
  CREATE FUNCTION test() RETURNS INTEGER
  LANGUAGE sql SET default_text_search_config TO 'public.pg' AS $$
  SELECT 1;
  $$;
  
  once you dump that you will end up with an invalid dump because the
  function will be dumped before the actual text search configuration is
  (re)created.
  
  I don't think it will work to try to fix this by reordering the dump;
  it's too easy to imagine scenarios where that would lead to circular
  ordering constraints.  What seems like a more workable answer is for
  CREATE FUNCTION to not attempt to validate SET clauses when
  check_function_bodies is off, or at least not throw a hard error when
  the validation fails.  (I see for instance that if you try
  ALTER ROLE joe SET default_text_search_config TO nonesuch;
  you just get a notice and not an error.)
  
  However, I don't recall if there are any places where we assume the
  SET info was pre-validated by CREATE/ALTER FUNCTION.
 
 any further insights into that issue? - seems a bit silly to have an
 open bug that actually prevents us from taking (restorable) backups of
 the search system on our own website...

It would seem that a simple solution would be to add an elevel argument
to ProcessGUCArray and then call it with NOTICE in the case that
check_function_bodies is true.  None of the contrib modules call
ProcessGUCArray, but should we worry that some external module does?

This doesn't address Tom's concern that we may trust in the SET to
ensure that the value stored is valid.  That seems like it'd be pretty
odd given how we typically handle GUCs, but I've not done a
comprehensive review to be sure.

Like Stefan, I'd really like to see this fixed, and sooner rather than
later, so we can continue the process of upgrading our systems to 9.2..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Peter Geoghegan
On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund and...@2ndquadrant.com wrote:
  Won't S2 in this case exclusively lock a page in foo_a_key (the one
  where 2 will reside on) for 3600s, while it's waiting for S1 to finish
  while getting the speculative insertion into foo_b_key?

 Yes. The way the patch currently handles that case is unacceptable. It
 needs to be more concerned about holding an exclusive lock on
 earlier-locked indexes when locking the second and subsequent indexes
 iff it blocks on another transaction in the course of locking the
 second and subsequent indexes.

I think that the fix here may be to do something analogous to what the
code already does: release all buffer locks, wait for the other
transaction to commit/abort, and try again. The difference being that
we try again across all unique indexes rather than just this one. I
have other obligations to take care of over the next couple of days,
so I don't really have time to code that up and satisfy myself that
it's robust; I'd prefer to have something more concrete to back this
thought up, but that's my immediate reaction for what it's worth.

We're looking for the first duplicate. So it would probably be okay
for the IGNORE case to not bother retrying and re-locking if the other
transaction committed (which, at least very broadly speaking, is the
likely outcome).

 After some thinking I don't think any solution primarily based on
 holding page level locks across other index operations is going to scale
 ok.

Just to be clear: I'm certainly not suggesting that it's okay to have
exclusive buffer locks held for any amount of time longer than
instantaneous. Furthermore, it isn't necessary to do anything
different to what we currently do with non-unique btree indexes during
speculative insertion - I don't treat them any differently, and they
only get locked in the usual way at the usual time, after heap
insertion. Much of the time, there'll only be a primary key index to
lock, and you'll have increased the window of the exclusive write-lock
(compared to the existing code when doing a regular insertion) by only
a very small amount - the duration of heap tuple insertion (the
baseline here is the duration of finding an insertion location on page
for index tuple, tuple insertion + possible page split). We're talking
about a write-lock on a btree leaf page, which is only going to block
other sessions from *finishing off* tuple insertion, and then only for
a minority of such insertions much of the time.

It's probably worth doing as much work up-front as possible (in terms
of asking questions about if we even want to lock the indexes and so
on, within ExecLockIndexTuples()), so as to decrease the window that
those locks are held. That's just an obvious optimization.

As users add unique indexes to tables, things do of course get worse
and worse. However, since in order for any of this to matter, there
has to be lots of concurrent activity in clustered ranges of values,
it's pretty likely that you'd conclude that tuple insertion should not
go ahead early on, and not end up doing too much exclusive/write
locking per tuple.

The question must be asked: if you don't think this will scale okay,
what are you comparing it to? The most useful basis of comparison is
likely to be other systems. Asking about uniqueness across many unique
indexes, and getting them all to commit to their unanimous position
for a moment in time is inherently involved - I don't see a way to do
it other than lock values, and I don't see any granularity at which
that's possible other than the btree page granularity.

 What I had in mind when playing around with this is the following trick:
 Just as in your patch split index insertion into two phases and perform
 one of them before inserting the heap tuple. In what you called
 speculative insertion don't stop at locking the page, but actually
 insert an index tuple in the index. As we haven't yet inserted the heap
 tuple we obviously can't insert an actual tuple. Instead set a flag on
 the item and reuse the the item pointer to store the xid of the
 inserting process. I coined those items insertion promises...

I did consider that sort of approach from an early stage - Simon
suggested it to me privately early last year. Actually, I even pointed
out in the 2012 developer meeting that there were unused IndexTuple
flag bits available for such purposes.

 Index searches will ignore promises they see, but concurrent index
 insertions will notice it's a promise and wait for the xid (or not, if
 it has aborted).

If inserters must block, they'll have to restart - I'm sure you'll
agree that they can't sit on even a shared/read buffer lock
indefinitely. So, have you really saved anything? Especially since
you'll have to search at least twice (more shared locks on *root*
pages) where my scheme only needs to search once. Also, are you really
sure that index searches should have license to ignore these promises?
What about dirty snapshots?


Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Peter Geoghegan
On Sat, Aug 31, 2013 at 7:38 PM, Peter Geoghegan p...@heroku.com wrote:
 Imo that solution is fairly elegant because it doesn't cause any heap
 bloat, and it causes the same amount of index bloat
 insert-into-heap-first type of solutions cause.

 I don't think that's a reasonable comparison. Bloating indexes with
 dead *duplicate* tuples is, in general, worse than doing the same with
 the heap. If you end up with a heavily bloated index with lots of
 duplicates, that will adversely affect performance worse than with
 non-duplicate index tuples (see [1] for example). That was one reason
 that HOT helped as much as it did, I believe.

Bear in mind also that one unique index could hardly ever have real
duplicates, but it would still get broken promise/bloat tuples
because another, later index said it had a duplicate. All of the
indexes get bloated (quite possibly) just because of a duplicate in
one. With a table with many unique indexes and many reasons to decide
to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY
IGNORE, it isn't hard to imagine them all becoming heavily bloated
very quickly.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers