date:20130831

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Peter Geoghegan

On Sat, Aug 31, 2013 at 7:53 PM, Peter Geoghegan  wrote:
> With a table with many unique indexes and many reasons to decide
> to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY
> IGNORE, it isn't hard to imagine them all becoming heavily bloated
> very quickly.

There may be no index tuple directly corresponding to some heap tuple
where you might expect that, due to transactions aborting (as with a
traditional unique constraint violations) or because of HOT, but,
prior to now, never the other way around (heap tuples are created
first and physically killed last). That is an assumption that has
existed since the Berkeley days, and it is something that may be
treated as an invariant in some places. Autovacuum is one such place,
I'd say. Isn't it the case that the autovacuum launcher daemon might
not hear about all of this index-only bloat, even if it was extremely
serious? Isn't the threshold it follows only in respect of heap tuples
(even if the statistics collector does collect index tuple stats
separately)?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Peter Geoghegan

On Sat, Aug 31, 2013 at 7:38 PM, Peter Geoghegan  wrote:
>> Imo that solution is fairly elegant because it doesn't cause any heap
>> bloat, and it causes the same amount of index bloat
>> "insert-into-heap-first" type of solutions cause.
>
> I don't think that's a reasonable comparison. Bloating indexes with
> dead *duplicate* tuples is, in general, worse than doing the same with
> the heap. If you end up with a heavily bloated index with lots of
> duplicates, that will adversely affect performance worse than with
> non-duplicate index tuples (see [1] for example). That was one reason
> that HOT helped as much as it did, I believe.

Bear in mind also that one unique index could hardly ever have real
duplicates, but it would still get "broken promise"/bloat tuples
because another, later index said it had a duplicate. All of the
indexes get bloated (quite possibly) just because of a duplicate in
one. With a table with many unique indexes and many reasons to decide
to reject a tuple proposed for insertion by INSERT...ON DUPLICATE KEY
IGNORE, it isn't hard to imagine them all becoming heavily bloated
very quickly.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Peter Geoghegan

On Sat, Aug 31, 2013 at 11:34 AM, Andres Freund  wrote:
>> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
>> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
>> > while getting the speculative insertion into foo_b_key?
>>
>> Yes. The way the patch currently handles that case is unacceptable. It
>> needs to be more concerned about holding an exclusive lock on
>> earlier-locked indexes when locking the second and subsequent indexes
>> iff it blocks on another transaction in the course of locking the
>> second and subsequent indexes.

I think that the fix here may be to do something analogous to what the
code already does: release all buffer locks, wait for the other
transaction to commit/abort, and try again. The difference being that
we try again across all unique indexes rather than just this one. I
have other obligations to take care of over the next couple of days,
so I don't really have time to code that up and satisfy myself that
it's robust; I'd prefer to have something more concrete to back this
thought up, but that's my immediate reaction for what it's worth.

We're looking for the first duplicate. So it would probably be okay
for the IGNORE case to not bother retrying and re-locking if the other
transaction committed (which, at least very broadly speaking, is the
likely outcome).

> After some thinking I don't think any solution primarily based on
> holding page level locks across other index operations is going to scale
> ok.

Just to be clear: I'm certainly not suggesting that it's okay to have
exclusive buffer locks held for any amount of time longer than
instantaneous. Furthermore, it isn't necessary to do anything
different to what we currently do with non-unique btree indexes during
speculative insertion - I don't treat them any differently, and they
only get locked in the usual way at the usual time, after heap
insertion. Much of the time, there'll only be a primary key index to
lock, and you'll have increased the window of the exclusive write-lock
(compared to the existing code when doing a regular insertion) by only
a very small amount - the duration of heap tuple insertion (the
baseline here is the duration of finding an insertion location on page
for index tuple, tuple insertion + possible page split). We're talking
about a write-lock on a btree leaf page, which is only going to block
other sessions from *finishing off* tuple insertion, and then only for
a minority of such insertions much of the time.

It's probably worth doing as much work up-front as possible (in terms
of asking questions about if we even want to lock the indexes and so
on, within ExecLockIndexTuples()), so as to decrease the window that
those locks are held. That's just an obvious optimization.

As users add unique indexes to tables, things do of course get worse
and worse. However, since in order for any of this to matter, there
has to be lots of concurrent activity in clustered ranges of values,
it's pretty likely that you'd conclude that tuple insertion should not
go ahead early on, and not end up doing too much exclusive/write
locking per tuple.

The question must be asked: if you don't think this will scale okay,
what are you comparing it to? The most useful basis of comparison is
likely to be other systems. Asking about uniqueness across many unique
indexes, and getting them all to commit to their unanimous position
for a moment in time is inherently involved - I don't see a way to do
it other than lock values, and I don't see any granularity at which
that's possible other than the btree page granularity.

> What I had in mind when playing around with this is the following trick:
> Just as in your patch split index insertion into two phases and perform
> one of them before inserting the heap tuple. In what you called
> "speculative insertion" don't stop at locking the page, but actually
> insert an index tuple in the index. As we haven't yet inserted the heap
> tuple we obviously can't insert an actual tuple. Instead set a flag on
> the item and reuse the the item pointer to store the xid of the
> inserting process. I coined those items "insertion promises"...

I did consider that sort of approach from an early stage - Simon
suggested it to me privately early last year. Actually, I even pointed
out in the 2012 developer meeting that there were unused IndexTuple
flag bits available for such purposes.

> Index searches will ignore promises they see, but concurrent index
> insertions will notice it's a promise and wait for the xid (or not, if
> it has aborted).

If inserters must block, they'll have to restart - I'm sure you'll
agree that they can't sit on even a shared/read buffer lock
indefinitely. So, have you really saved anything? Especially since
you'll have to search at least twice (more shared locks on *root*
pages) where my scheme only needs to search once. Also, are you really
sure that index searches should have license to ignore these promises?
What about d

Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump

2013-08-31 Thread Stephen Frost

* Stefan Kaltenbrunner (ste...@kaltenbrunner.cc) wrote:
> On 08/18/2013 05:40 PM, Tom Lane wrote:
> > Stefan Kaltenbrunner  writes:
> >> While working on upgrading the database of the search system on
> >> postgresql.org to 9.2 I noticed that the dumps that pg_dump generates on
> >> that system are actually invalid and cannot be reloaded without being
> >> hacked on manually...
> > 
> >> CREATE TEXT SEARCH CONFIGURATION pg (
> >> PARSER = pg_catalog."default" );
> > 
> >> CREATE FUNCTION test() RETURNS INTEGER
> >> LANGUAGE sql SET default_text_search_config TO 'public.pg' AS $$
> >> SELECT 1;
> >> $$;
> > 
> >> once you dump that you will end up with an invalid dump because the
> >> function will be dumped before the actual text search configuration is
> >> (re)created.
> > 
> > I don't think it will work to try to fix this by reordering the dump;
> > it's too easy to imagine scenarios where that would lead to circular
> > ordering constraints.  What seems like a more workable answer is for
> > CREATE FUNCTION to not attempt to validate SET clauses when
> > check_function_bodies is off, or at least not throw a hard error when
> > the validation fails.  (I see for instance that if you try
> > ALTER ROLE joe SET default_text_search_config TO nonesuch;
> > you just get a notice and not an error.)
> > 
> > However, I don't recall if there are any places where we assume the
> > SET info was pre-validated by CREATE/ALTER FUNCTION.
> 
> any further insights into that issue? - seems a bit silly to have an
> open bug that actually prevents us from taking (restorable) backups of
> the search system on our own website...

It would seem that a simple solution would be to add an elevel argument
to ProcessGUCArray and then call it with NOTICE in the case that
check_function_bodies is true.  None of the contrib modules call
ProcessGUCArray, but should we worry that some external module does?

This doesn't address Tom's concern that we may trust in the SET to
ensure that the value stored is valid.  That seems like it'd be pretty
odd given how we typically handle GUCs, but I've not done a
comprehensive review to be sure.

Like Stefan, I'd really like to see this fixed, and sooner rather than
later, so we can continue the process of upgrading our systems to 9.2..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump

2013-08-31 Thread Stefan Kaltenbrunner

On 08/18/2013 05:40 PM, Tom Lane wrote:
> Stefan Kaltenbrunner  writes:
>> While working on upgrading the database of the search system on
>> postgresql.org to 9.2 I noticed that the dumps that pg_dump generates on
>> that system are actually invalid and cannot be reloaded without being
>> hacked on manually...
> 
>> CREATE TEXT SEARCH CONFIGURATION pg (
>> PARSER = pg_catalog."default" );
> 
>> CREATE FUNCTION test() RETURNS INTEGER
>> LANGUAGE sql SET default_text_search_config TO 'public.pg' AS $$
>> SELECT 1;
>> $$;
> 
>> once you dump that you will end up with an invalid dump because the
>> function will be dumped before the actual text search configuration is
>> (re)created.
> 
> I don't think it will work to try to fix this by reordering the dump;
> it's too easy to imagine scenarios where that would lead to circular
> ordering constraints.  What seems like a more workable answer is for
> CREATE FUNCTION to not attempt to validate SET clauses when
> check_function_bodies is off, or at least not throw a hard error when
> the validation fails.  (I see for instance that if you try
> ALTER ROLE joe SET default_text_search_config TO nonesuch;
> you just get a notice and not an error.)
> 
> However, I don't recall if there are any places where we assume the
> SET info was pre-validated by CREATE/ALTER FUNCTION.

any further insights into that issue? - seems a bit silly to have an
open bug that actually prevents us from taking (restorable) backups of
the search system on our own website...



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.4] row level security

2013-08-31 Thread Stephen Frost

KaiGai,

* Kohei KaiGai (kai...@kaigai.gr.jp) wrote:
> The point we shouldn't forget is information leakage via covert-channel
> has less grade of threat than leakage via main-channel, because of
> much less bandwidth.

While true, this argument can't be used to simply ignore any and all
covert channels.  There are covert channels which are *not* much less
bandwidth, and the Filtered Rows one is one of those- it's simply too
big to ignore.  There are likely other which are equally large and
until we clean those up our RLS implementation will be questioned by
our users.

This does not mean that we need to clean up all covert channels and
things which are clearly intractable don't need to be addressed (eg:
the unique constraint situation; we can't both guarantee uniqueness
and hide the existance of an entry).

> Security community also concludes it is not avoidable nature as long
> as human can observe system behavior and estimate something,

This particular case is actually beyond 'estimation'. 

> Even if attacker has enough knowledge, the ratio they can estimate
> hidden values is very limited because of much less bandwidth of
> covert channels.

You really need to back away from this argument in this case.  The
example shown isn't based on estimation and provides quite high
bandwidth because it can be run by a computer.  This argument can't
be used for every situation where information is leaked.

> However, in general, it is impossible to eliminate anything in spite of
> less degree of threats because of smaller bandwidth. So, I don't think
> this is an issue to spent much efforts to solve.

I don't see a lot of complexity required to fix this specific case.  A
great deal of effort will be involved in going through the rest of the
code and various options and working to eliminate other similar cases,
but that's a reality we need to deal with.  Hopefully the cases found
will not require a lot of additional code to deal with.  We will need to
be mindful of new code which adds leaks or changes behavior such that
RLS doesn't function properly (hopefully, sufficient regression tests
will help to address that as well).

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

2013-08-31 Thread Andres Freund

Hi,

On 2013-08-30 19:38:24 -0700, Peter Geoghegan wrote:
> On Fri, Aug 30, 2013 at 5:47 PM, Andres Freund  wrote:
> > While I ponder on it some more, could you explain whether I understood
> > something correctly? Consider the following scenario:
> >
> > CREATE TABLE foo(a int UNIQUE, b int UNIQUE);
> >
> > S1: INSERT INTO foo(0, 0);
> > S1: BEGIN;
> > S1: INSERT INTO foo(1, 1);
> > S1: SELECT pg_sleep(3600);
> > S2: BEGIN;
> > S2: INSERT INTO foo(2, 1);
> > S3: SELECT * FROM foo WHERE a = 0;
> >
> > Won't S2 in this case exclusively lock a page in foo_a_key (the one
> > where 2 will reside on) for 3600s, while it's waiting for S1 to finish
> > while getting the speculative insertion into foo_b_key?
> 
> Yes. The way the patch currently handles that case is unacceptable. It
> needs to be more concerned about holding an exclusive lock on
> earlier-locked indexes when locking the second and subsequent indexes
> iff it blocks on another transaction in the course of locking the
> second and subsequent indexes.

After some thinking I don't think any solution primarily based on
holding page level locks across other index operations is going to scale
ok.

What I had in mind when playing around with this is the following trick:
Just as in your patch split index insertion into two phases and perform
one of them before inserting the heap tuple. In what you called
"speculative insertion" don't stop at locking the page, but actually
insert an index tuple in the index. As we haven't yet inserted the heap
tuple we obviously can't insert an actual tuple. Instead set a flag on
the item and reuse the the item pointer to store the xid of the
inserting process. I coined those items "insertion promises"...

Index searches will ignore promises they see, but concurrent index
insertions will notice it's a promise and wait for the xid (or not, if
it has aborted). That fits pretty well into the existing btree code.  If
the insertion of an index promise worked for all unique indexes and the
heap tuple has been inserted, convert those promises into actual item
pointers pointing to the heap tuple. That probably requires a second
search and some cuteness to find the correct pointer because of
concurrent splits, but that's not too bad (possibly you can do away with
that if we continue to hold a pin).

The fact that now xids can be in the index creates some slight
complications because it now needs to be vacuumed - but that's
relatively easy to fit into the bulkdelete api and I don't think the
contortions to the vacuum code will be too bad.

Imo that solution is fairly elegant because it doesn't cause any heap
bloat, and it causes the same amount of index bloat
"insert-into-heap-first" type of solutions cause. It also has pretty
good concurrency behaviour because you never need to retain a page lock
for longer than a normal index insertion.
It's certainly a bit more complex than your solution tho...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory

2013-08-31 Thread Robert Haas

On Fri, Aug 30, 2013 at 11:45 AM, Andres Freund  wrote:
>> The way I've designed it, no.  If what we expect to be the control
>> segment doesn't exist or doesn't conform to our expectations, we just
>> assume that it's not really the control segment after all - e.g.
>> someone rebooted, clearing all the segments, and then an unrelated
>> process (malicious, perhaps, or just a completely different cluster)
>> reused the same name.  This is similar to what we do for the main
>> shared memory segment.
>
> The case I am mostly wondering about is some process crashing and
> overwriting random memory. We need to be pretty sure that we'll never
> fail partially through cleaning up old segments because they are
> corrupted or because we died halfway through our last cleanup attempt.

Right.  I had those considerations in mind and I believe I have nailed
the hatch shut pretty tight.  The cleanup code is designed never to
die with an error.  Of course it might, but it would have to be
something like an out of memory failure or similar that isn't really
what we're concerned about here.  You are welcome to look for holes,
but these issues are where most of my brainpower went during
development.

>> That's true, but that decision has not been uncontroversial - e.g. the
>> NetBSD guys don't like it, because they have a big performance
>> difference between those two types of memory.  We have to balance the
>> possible harm of one more setting against the benefit of letting
>> people do what they want without needing to recompile or modify code.
>
> But then, it made them fix the issue afaik :P

Pah.  :-)

>> You can look at it while the server's running.
>
> That's what debuggers are for.

Tough crowd.  I like it.  YMMV.

>> I would never advocate deliberately trying to circumvent a
>> carefully-considered OS-level policy decision about resource
>> utilization, but I don't think that's the dynamic here.  I think if we
>> insist on predetermining the dynamic shared memory implementation
>> based on the OS, we'll just be inconveniencing people needlessly, or
>> flat-out making things not work. [...]
>
> But using file-backed memory will *suck* performancewise. Why should we
> ever want to offer that to a user? That's what I was arguing about
> primarily.

I see.  There might be additional writeback traffic, but it might not
be that bad in common cases.  After all the data's pretty hot.

>> If we're SURE
>> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
>> 100% of cases, and that a NetBSD user will always prefer "sysv" over
>> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
>> But I'm not that sure.
>
> I think posix shmem will be preferred to sysv shmem if present, in just
> about any relevant case. I don't know of any system with lower limits on
> posix shmem than on sysv.

OK, how about this  SysV doesn't allow extending segments, but
mmap does.  The thing here is that you're saying "remove mmap and keep
sysv" but Noah suggested to me that we remove sysv and keep mmap.
This suggests to me that the picture is not so black and white as you
think it is.

>> I shared your opinion that preferred_address is never going to be
>> reliable, although FWIW Noah thinks it can be made reliable with a
>> large-enough hammer.
>
> I think we need to have the arguments for that on list then. Those are
> pretty damn fundamental design decisions.
> I for one cannot see how you even remotely could make that work a) on
> windows (check the troubles we have to go through to get s_b
> consistently placed, and that's directly after startup) b) 32bit systems.

Noah?

>> But even if it isn't reliable, there doesn't seem to be all that much
>> value in forbidding access to that part of the OS-provided API.  In
>> the world where it's not reliable, it may still be convenient to map
>> things at the same address when you can, so that pointers can't be
>> used.  Of course you'd have to have some fallback strategy for when
>> you don't get the same mapping, and maybe that's painful enough that
>> there's no point after all.  Or maybe it's worth having one code path
>> for relativized pointers and another for non-relativized pointers.
>
> It seems likely to me that will end up with untested code in that
> case. Or even unsupported platforms.

Maybe.  I think for the amount of code we're talking about here, it's
not worth getting excited about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dynamic shared memory

2013-08-31 Thread Robert Haas

On Thu, Aug 29, 2013 at 8:12 PM, Jim Nasby  wrote:
> On 8/13/13 8:09 PM, Robert Haas wrote:
>> is removed, the segment automatically goes away (we could allow for
>> server-lifespan segments as well with only trivial changes, but I'm
>> not sure whether there are compelling use cases for that).
>
> To clarify... you're talking something that would intentionally survive
> postmaster restart? I don't see use for that either...

No, I meant something that would live as long as the postmaster and
die when it dies.

> Ignorant question... is ResourceOwner related to memory contexts? If not,
> would memory contexts be a better way to handle memory segment cleanup?

Nope.  :-)

>> There are quite a few problems that this patch does not solve.  First,
>
> It also doesn't provide any mechanism for notifying backends of a new
> segment. Arguably that's beyond the scope of dsm.c, but ISTM that it'd be
> useful to have a standard method or three of doing that; perhaps just some
> convenience functions wrapping the methods mentioned in comments.

I don't see that as being generally useful.  Backends need to know
more than "there's a new segment", and in fact most backends won't
care about most new segments.  A background worker needs to know about
the new segment *that it should attach*, but we have bgw_main_arg.  If
we end up using this facility for any system-wide purposes, I imagine
we'll do that by storing the segment ID in the main shared memory
segment someplace.

> Thanks to you and the rest of the folks at EnterpriseDB... dynamic shared
> memory is something we've needed forever! :)

Thanks.

> Other comments...
>
> + * If the state file is empty or the contents are garbled, it probably
> means
> + * that the operating system rebooted before the data written by the
> previous
> + * postmaster made it to disk.  In that case, we can just ignore it; any
> shared
> + * memory from before the reboot should be gone anyway.
>
> I'm a bit concerned about this; I know it was possible in older versions for
> the global shared memory context to be left behind after a crash and needing
> to clean it up by hand. Dynamic shared mem potentially multiplies that by
> 100 or more. I think it'd be worth changing dsm_write_state_file so it
> always writes a new file and then does an atomic mv (or something similar).

I agree that the possibilities for leftover shared memory segments are
multiplied with this new facility, and I've done my best to address
that.  However, I don't agree that writing the state file in a
different way would improve anything.

> +* If some other backend exited uncleanly, it might have corrupted
> the
> +* control segment while it was dying.  In that case, we warn and
> ignore
> +* the contents of the control segment.  This may end up leaving
> behind
> +* stray shared memory segments, but there's not much we can do
> about
> +* that if the metadata is gone.
>
> Similar concern... in this case, would it be possible to always write
> updates to an un-used slot and then atomically update a pointer? This would
> be more work than what I suggested above, so maybe just a TODO for now...
>
> Though... is there anything a dying backend could do that would corrupt the
> control segment to the point that it would screw up segments allocated by
> other backends and not related to the dead backend? Like marking a slot as
> not used when it is still in use and isn't associated with the dead backend?

Sure.  A messed-up backend can clobber the control segment just as it
can clobber anything else in shared memory.  There's really no way
around that problem.  If the control segment has been overwritten by a
memory stomp, we can't use it to clean up.  There's no way around that
problem except to not the control segment, which wouldn't be better.

> Should this (in dsm_attach)
>
> +* If you're hitting this error, you probably want to use attempt to
>
> be
>
> +* If you're hitting this error, you probably want to attempt to
>
> ?

Good point.

> Should dsm_impl_op sanity check the arguments after op? I didn't notice
> checks in the type-specific code but I also didn't read all of it... are we
> just depending on the OS to sanity-check?

Sanity-check for what?

> Also, does the GUC code enforce that the GUC must always be something that's
> supported? If not then the error in dsm_impl_op should be more
> user-friendly.

Yes.

> I basically stopped reading after dsm_impl_op... the rest of the stuff was
> rather over my head.

:-)

Thanks for your interest!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump

Re: [HACKERS] CREATE FUNCTION .. SET vs. pg_dump

Re: [HACKERS] [v9.4] row level security

Re: [HACKERS] INSERT...ON DUPLICATE KEY IGNORE

Re: [HACKERS] dynamic shared memory

Re: [HACKERS] dynamic shared memory

9 matches

Site Navigation

Mail list logo

Footer information