date:20130927

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE - visibility semantics

2013-09-27 Thread Peter Geoghegan

On Tue, Sep 24, 2013 at 2:14 AM, Andres Freund  wrote:
> Various messages are discussing semantics around visibility. I by now
> have a hard time keeping track. So let's keep the discussion of the
> desired semantics to this thread.

Yes, it's pretty complicated.

I meant to comment on this here, but ended up saying some stuff to
Robert about this in the main thread, so I should probably direct you
to that. You were probably right to start a new thread, because I
think we can usefully discuss this topic in parallel, but that's just
what ended up happening.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2013-09-27 Thread Peter Geoghegan

On Fri, Sep 27, 2013 at 5:36 AM, Robert Haas  wrote:
>> I don't have another idea either. In fact, I'd go so far as to say
>> that doing any third thing that's better than those two to any
>> reasonable person is obviously impossible. But I'd add that we simple
>> cannot rollback at read committed, so we're just going to have to hold
>> our collective noses and do strange things with visibility.
>
> I don't accept that as a general principal.  We're writing the code;
> we can make it behave any way we think best.

I presume you're referring to the principle that we cannot throw
serialization failures at read committed. I'd suggest that letting
that happen would upset a lot of people, because it's so totally
unprecedented. A large segment of our user base would just consider
that to be Postgres randomly throwing errors, and would be totally
dismissive of the need to do so, and not without some justification -
no one else does the same. The reality is that the majority of our
users don't even know what an isolation level is. I'm not just talking
about people that use Postgres more casually, such as Heroku
customers. I've personally talked to people who didn't even know what
a transaction isolation level was, that were in a position where they
really, really should have known.

> I doubt that any change to HeapTupleSatisfiesMVCC() will be
> acceptable.  This feature needs to restrain itself to behavior changes
> that only affect users of this feature, I think.

I agree with the principle of what you're saying, but I'm not aware
that those changes to HeapTupleSatisfiesMVCC() imply any behavioral
changes for those not using the feature. Certainly, the standard
regression tests and isolation tests still pass, for what it's worth.
Having said that, I have not thought about it enough to be willing to
actually defend that bit of code. Though I must admit that I am a
little encouraged by the fact that it passes casual inspection.

I am starting to wonder if it's really necessary to have a "blessed"
update that can see the locked, not-otherwise-visible tuple. Doing
that certainly has its disadvantages, both in terms of code complexity
and in terms of being arbitrarily restrictive. We're going to have to
allow the user to see the locked row after it's updated (the new row
version that we create will naturally be visible to its creating xact)
- is it really any worse that the user can see it before an update (or
a delete)? The user could decide to effectively make the update change
nothing, and see the same thing anyway.

I get why you're averse to doing odd things to visibility - I was too.
I just don't see that we have a choice if we want this feature to work
acceptably with read committed. In addition, as it happens I just
don't see that the general situation is made any worse by the fact
that the user might be able to see the row before an update/delete.
Isn't is also weird to update or delete something you cannot see?

Couldn't EvalPlanQual() be said to be an MVCC violation on similar
grounds? It also "reaches into the future". Locking a row isn't really
that distinct from updating it in terms of the code footprint, but
also from a logical perspective.

> It's probably important to avoid having
> them keep recreating speculative tuples and then killing them as long
> as a candidate tuple is available, so that they don't create a dead
> tuple per iteration.  But that seems doable.

I'm not so sure.

>> Realize you can generally only kill the heap tuple *before* you have
>> the row lock, because otherwise a totally innocent non-HOT update (may
>> not update any unique indexed columns at all) will deadlock with your
>> session, which I don't think is defensible, and will probably happen
>> often if allowed to (after all, this is upsert - users are going to
>> want to update their locked rows!).
>
> I must be obtuse; I don't see why that would deadlock.

If you don't see it, then you aren't being obtuse in asking for
clarification. It's really easy to be wrong about this kind of thing.

If the non-HOT update updates some random row, changing the key
columns, it will lock that random row version. It will then proceed
with "value locking" (i.e. inserting index tuples in the usual way, in
this case with entirely new values). It might then block on one of the
index tuples we, the upserter, have already inserted (these are our
"value locks" under your scheme). Meanwhile, we (the upserter) might
have *already* concluded that the *old* heap row that the regular
updater is in the process of rendering invisible is to blame in
respect of some other value in some later unique index, and that *it*
must be locked. Deadlock. This seems very possible if the key values
are somewhat correlated, which is probably generally quite common.

The important observation here is that an updater, in effect, locks
both the old and new sets of values (for non-HOT updates). And as I've
already noted, any practical "value locking" implementation is

Re: [HACKERS] record identical operator - Review

2013-09-27 Thread Kevin Grittner

Bruce Momjian  wrote:
> On Thu, Sep 19, 2013 at 06:31:38PM -0400, Steve Singer wrote:
>> On 09/12/2013 06:27 PM, Kevin Grittner wrote:
>>> Attached is a patch for a bit of infrastructure I believe to be
>>> necessary for correct behavior of REFRESH MATERIALIZED VIEW
>>> CONCURRENTLY as well as incremental maintenance of matviews.
>>
>> Here is attempt at a review of the v1 patch.
>>
>> There has been a heated discussion on list about if we even want
>> this type of operator. I will try to summarize it here
>>
>> The arguments against it are
>> * that a record_image_identical call on two records that returns
>> false can still return true under the equality operator, and the
>> records might (or might not) appear to be the same to users
>> * Once we expose this as an operator via SQL someone will find
>> it and use it and then complain when we change things such as
>> the internal representation of a datatype which might break
>> there queries

I don't see where that is possible unless they count on the order
of records sorted with the new operators, which would not be
something I recommend.  Changing the internal representation cannot
break simple use of the alternative equality definition as long as
all you cared about was whether the binary storage format of the
values was identical.

>> * The differences between = and record_image_identical might not
>> be understood by everywhere who finds the operator leading to
>> confusion

Either they find it and read the code, or we document it and they
read the docs.

>> * Using a criteria other than equality (the = operator provided
>> by the datatype) might cause problems if we later provide 'on
>> change' triggers for materialized views

I would fight user-defined triggers for matviews tooth and nail.
We need to get incremental maintenance working instead.

>> The arguments for this patch are
>> * We want the materialized view to return the same value as
>> would be returned if the query were executed directly.  This not
>> only means that the values should be the same according to a
>> datatypes = operator but that they should appear the same 'to
>> the eyeball'.

And to functions the user can run against the values.  The example
with the null bitmap for arrays being included when not necessary
is something that isn't directly apparent to the eye, but queries
which use pg_column_size would not get equal results.

>> * Supporting the materialized view refresh check via SQL makes a
>> lot of sense but doing that requires exposing something via SQL
>> * If we adequately document what we mean by
>> record_image_identical and the operator we pick for this then
>> users shouldn't be surprised at what they get if they use this

We first need to document the existing record comparison operators.
If they read the docs for comparing "row_constructors" and expect
that to be the behavior they get when they compare records, they
will be surprised.

> This is a good summary.  I think there are a few things that make
> this issue difficult to decide:
>
> 1.  We have to use an operator to give the RMVC (REFRESH
> MATERIALIZED VIEW CONCURRENTLY) SPI query the ability to optimize
> this query.  If we could do this with an SQL or C function, there
> would be less concern about the confusion caused by adding this
> capability.

 (a)  We can't.

 (b)  Why would that be less confusing?

> Question:  If we are comparing based on some primary key, why do
> we need this to optimize?

Because comparing primary keys doesn't tell us whether the old and
new values in the row all match.

> Certainly the primary key index will be used, no?

It currently uses all columns which are part of unique indexes on
the matview which are not partial (i.e., they do not use a WHERE
clause), they index only on columns (not expressions).  It is not
possible to define a primary key on a matview.  There are two
reasons for using these columns:

(a)  It provides a correctness guarantee against having duplicated
rows which have no nulls.  This guarantee is what allows us to use
the differential technique which is faster than retail DELETE and
re-INSERT of everything.

(b)  Since we don't support hash joins of records, and it would
tend to be pretty slow if we did, it allows faster hash joins on
one or more columns which stand a good chance of doing this
efficiently.

> 2.  Agregates are already non-deterministic,

Only some of them, and only with certain source data.

> so some feel that adding this feature doesn't improve much.

It allows a materialized view which contains a column of a type
without a default btree opclass to *use* concurrent refresh, it
makes queries with deterministic results always generate matview
results with exactly match the results of a VIEW using the query if
it were run at the same moment, and for nondeterministic queries,
it provides results consistent with *some* result set which could
have been returned by a view at the same time.  Unless we do
something, none of these things a

Re: [HACKERS] Extra functionality to createuser

2013-09-27 Thread Christopher Browne

Attached is a patch implementing the "-g / --roles" option for createuser.

I'll be attaching it to the open CommitFest shortly.


createuser.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Andres Freund

Hi,

What confuses me is that pg_read_barrier() is just a compiler barrier on
x86[-64] in barrier.h. According to my knowledge it needs to be an
lfence or the full barrier?
The linked papers from Paul McKenney - which are a great read - seem to
agree?

On 2013-09-27 23:12:17 +0200, Andres Freund wrote:
> On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:
> > The function RecoveryInProgress() function does just one load, to read the
> > variable, and wouldn't even need a barrier by itself. The other load or
> > store that needs to be protected by the barrier happens in the caller,
> > before or after the function, and we can't say for sure if it's a load or a
> > store. So, let's use pg_memory_barrier().
> 
> The caller uses a spinlock, so it's guaranteed to write out before the
> spinlock is released. A write barrier (the spinlock in the startup
> process) should always be paired by a read barrier.

s/caller/startup process/

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund

On 2013-09-27 14:46:50 +0200, Florian Pflug wrote:
> On Sep27, 2013, at 00:55 , Andres Freund  wrote:
> > So the goal is to have LWLockAcquire(LW_SHARED) never block unless
> > somebody else holds an exclusive lock. To produce enough appetite for
> > reading the rest of the long mail, here's some numbers on a
> > pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
> >
> > master + padding: tps = 146904.451764
> > master + padding + lwlock: tps = 590445.927065
> >
> > That's rougly 400%.
>
> Interesting. I played with pretty much the same idea two years or so ago.
> At the time, I compared a few different LWLock implementations. Those
> were AFAIR
>
>   A) Vanilla LWLocks
>   B) A + an atomic-increment fast path, very similar to your proposal
>   C) B but with a partitioned atomic-increment counter to further
>  reduce cache-line contention
>   D) A with the spinlock-based queue replaced by a lockless queue
>
> At the time, the improvements seemed to be negligible - they looked great
> in synthetic benchmarks of just the locking code, but didn't translate
> to improved TPS numbers. Though I think the only version that ever got
> tested on more than a handful of cores was C…

I think you really need multi-socket systems to see the big benefits
from this. My laptop barely shows any improvements, while my older 2
socket workstation already shows some in workloads that have more
contention than pgbench -S.

From a quick look, you didn't have any sleeping queueing in at least one
of the variants in there? In my tests, that was tremendously important
to improve scaling if there was any contention. Which is not surprising
in the end, because otherwise you essentially have rw-spinlocks which
really aren't suitable for many of the lwlocks we use.

Getting the queueing semantics, including releaseOK, right was what took
me a good amount of time, the atomic ops part was pretty quick...

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6

2013-09-27 Thread Andres Freund

Hi Steve,

On 2013-09-27 17:06:59 -0400, Steve Singer wrote:
> >I've determined that when in this test the walsender seems to be hitting
> >this when it is decode the transactions that are behind the slonik
> >commands to add tables to replication (set add table, set add sequence).
> >This is before the SUBSCRIBE SET is submitted.
> >
> >I've also noticed something else that is strange (but might be unrelated).
> >If I stop my slon process and restart it I get messages like:
> >
> >WARNING:  Starting logical replication from 0/a9321360
> >ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00
> >
> >Where 0/A9321360 was sent in the last packet my slon received from the
> >walsender before the restart.

Uh, that looks like I fumbled some comparison. Let me check.

> I've further narrowed this down to something (or the combination of) what
> the  _disorder_replica.altertableaddTriggers(1);
> stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);
> 
> Which is essentially
> * Get an exclusive lock on sl_config_lock
> * Get an exclusive lock on the user table in question
> * create a trigger (the deny access trigger)
> * create a truncate trigger
> * create a deny truncate trigger
> 
> I am not yet able to replicate the error by issuing the same SQL commands
> from psql, but I must be missing something.
> 
> I can replicate this when just using the test_decoding plugin.

Thanks. That should get me started with debugging. Unless it's possibly
fixed in the latest version, one bug fixed there might cause something
like this if the moon stands exactly right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Andres Freund

On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:
> On 27.09.2013 22:00, Merlin Moncure wrote:
> >Attached is simplified patch that replaces the spinlock with a read
> >barrier based on a suggestion made by Andres offlist.  The approach
> >has different performance characteristics -- a barrier call is being
> >issued instead of a non-blocking read.   I don't have a performance
> >test case in hand to prove that's better so I'm going with Andre's
> >approach because it's simpler.
> 
> Can you think of a concrete example of what might go wrong if we simply
> removed the spinlock, and did an unprotected read through the volatile
> pointer? Is there some particular call sequence that might get optimized in
> an unfortunate way? I'm not suggesting that we do that - better safe than
> sorry even if we can't think of any particular scenario - but it would help
> me understand this.

I think in some architecutres the write from the other side might just
not immediately be visible, so RecoveryInProgress() might return a
spurious true, although we already finished.
I haven't checked whether there's any callers that would have a problem
with that.

> I don't think a read-barrier is enough here. See README.barrier:
> 
> >When a memory barrier is being used to separate two
> >loads, use pg_read_barrier(); when it is separating two stores, use
> >pg_write_barrier(); when it is a separating a load and a store (in either
> >order), use pg_memory_barrier().

I think the documentation is just worded in an easy to misunderstand
way. If you look at the following example, we're fine.
The "two stores", "two loads" it is talking about are what's happening
in one thread of execution. The important part of a read barrier is that
it forces a fresh LOAD, separated from earlier LOADs to the same
address.

> The function RecoveryInProgress() function does just one load, to read the
> variable, and wouldn't even need a barrier by itself. The other load or
> store that needs to be protected by the barrier happens in the caller,
> before or after the function, and we can't say for sure if it's a load or a
> store. So, let's use pg_memory_barrier().

The caller uses a spinlock, so it's guaranteed to write out before the
spinlock is released. A write barrier (the spinlock in the startup
process) should always be paired by a read barrier.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6

2013-09-27 Thread Steve Singer


On 09/26/2013 02:47 PM, Steve Singer wrote:



I've determined that when in this test the walsender seems to be 
hitting this when it is decode the transactions that are behind the 
slonik commands to add tables to replication (set add table, set add 
sequence).  This is before the SUBSCRIBE SET is submitted.


I've also noticed something else that is strange (but might be 
unrelated).  If I stop my slon process and restart it I get messages 
like:


WARNING:  Starting logical replication from 0/a9321360
ERROR:  cannot stream from 0/A9321360, minimum is 0/A9320B00

Where 0/A9321360 was sent in the last packet my slon received from the 
walsender before the restart.


If force it to restart replication from 0/A9320B00 I see datarows that 
I appear to have already seen before the restart.
I think this is happening when I process the data for 0/A9320B00 but 
don't get the feedback message my slon was killed. Is this expected?





I've further narrowed this down to something (or the combination of) 
what the  _disorder_replica.altertableaddTriggers(1);

stored function does.  (or @SLONYNAMESPACE@.altertableaddTriggers(int);

Which is essentially
* Get an exclusive lock on sl_config_lock
* Get an exclusive lock on the user table in question
* create a trigger (the deny access trigger)
* create a truncate trigger
* create a deny truncate trigger

I am not yet able to replicate the error by issuing the same SQL 
commands from psql, but I must be missing something.


I can replicate this when just using the test_decoding plugin.











Greetings,

Andres Freund













--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Peter Geoghegan

On Fri, Sep 27, 2013 at 1:28 PM, Heikki Linnakangas
 wrote:
> Yep. I only added the first caller of the barriers altogether in the
> xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it
> wasn't used until now, in 9.4.

FWIW, it was actually during 9.2 development that Robert first added
the barriers.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Heikki Linnakangas


On 27.09.2013 22:00, Merlin Moncure wrote:

On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure  wrote:

On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure  wrote:

On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund  wrote:

On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:

On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund  wrote:

On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:

+ bool
+ RecoveryMightBeInProgress(void)
+ {
+ /*
+  * We check shared state each time only until we leave recovery mode. We
+  * can't re-enter recovery, so there's no need to keep checking after the
+  * shared variable has once been seen false.
+  */
+ if (!LocalRecoveryInProgress)
+ return false;
+ else
+ {
+ /* use volatile pointer to prevent code rearrangement */
+ volatile XLogCtlData *xlogctl = XLogCtl;
+
+ /* Intentionally query xlogctl without spinlocking! */
+ LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+
+ return LocalRecoveryInProgress;
+ }
+ }


I don't think it's acceptable to *set* LocalRecoveryInProgress
here. That should only be done in the normal routine.


quite right -- that was a major error -- you could bypass the
initialization call to the xlog with some bad luck.


I've seen this in profiles since, so I'd appreciate pushing this
forward.


roger that -- will push ahead when I get into the office...


attached is new version fixing some comment typos.


Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.  The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read.   I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.


Can you think of a concrete example of what might go wrong if we simply 
removed the spinlock, and did an unprotected read through the volatile 
pointer? Is there some particular call sequence that might get optimized 
in an unfortunate way? I'm not suggesting that we do that - better safe 
than sorry even if we can't think of any particular scenario - but it 
would help me understand this.


I don't think a read-barrier is enough here. See README.barrier:


When a memory barrier is being used to separate two
loads, use pg_read_barrier(); when it is separating two stores, use
pg_write_barrier(); when it is a separating a load and a store (in either
order), use pg_memory_barrier().


The function RecoveryInProgress() function does just one load, to read 
the variable, and wouldn't even need a barrier by itself. The other load 
or store that needs to be protected by the barrier happens in the 
caller, before or after the function, and we can't say for sure if it's 
a load or a store. So, let's use pg_memory_barrier().



 Aside: can this truly the only caller
of pg_read_barrier()?


Yep. I only added the first caller of the barriers altogether in the 
xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, 
but it wasn't used until now, in 9.4.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls

2013-09-27 Thread Heikki Linnakangas


On 27.09.2013 14:12, Robert Haas wrote:

On Thu, Sep 26, 2013 at 4:20 PM, Alvaro Herrera
  wrote:

But how do we ensure that the BMS is allocated in a context?  You'd have
to switch contexts each time you call bms_add_member.  I don't have a
good answer to this.


The coding of bms_add_member is pretty funky.  Why doesn't it just
repalloc() the input argument if it's not big enough?  If it did that,
the new allocation would be in the same memory context as the original
one, and we'd not need to care about the memory context when adding an
element to an already non-empty set.

But even if we did decide to switch memory contexts on every call, it
would still be much cleaner than this.  It's only three lines of code
(and about as many instructions) to save and restore
CurrentMemoryContext.


[looks] Huh, yeah, as it stands, bms_add_member() is an accident waiting 
to happen. It's totally non-obvious that it might cause the BMS to jump 
from another memory context to CurrentMemoryContext. Same with 
bms_add_members(). And it would be good to Assert in bms_join that both 
arguments are in the same from MemoryContext.


Let's fix that...

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-09-27 Thread Jim Nasby


On 9/27/13 1:43 PM, Greg Stark wrote:

Honestly, I think we actually need more obfuscation between what happens on the 
filesystem and the rest of postgres... we're starting to look at areas where 
that would help. For example, the recent idea of being able to truncate 
individual relation files and not being limited to only truncating the end of 
the relation. My concern in that case is that 1GB is a pretty arbitrary size 
that we happened to pick, so if we're going to go for more efficiency in 
storage we probably shouldn't just blindly stick with 1G (though of course 
initial implementation might do that to reduce complexity, but we better still 
consider where we're headed).

The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.

But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as "deprecated" and move tuples out of them until we can just remove
that whole file.


Yeah, there's a ton of different things we might do. And dealing with free space is just 
one example... things like the VM give us the ability to detect areas of the heap that 
have gone "dormant"; imagine if we could seamlessly move that data to it's own 
storage, possibly compressing it at the same time. (Yes, I realize there's partitioning 
and tablespaces and compressing filesystems, but those are a lot more work and will never 
be as efficient as what the database itself can do).

Anyway, I think we're all on the same page. We should stop hijacking Alvaro's 
thread... ;)
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-27 Thread Merlin Moncure

On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure  wrote:
> On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure  wrote:
>> On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund  
>> wrote:
>>> On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
 On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund  
 wrote:
 > On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
 >> + bool
 >> + RecoveryMightBeInProgress(void)
 >> + {
 >> + /*
 >> +  * We check shared state each time only until we leave recovery 
 >> mode. We
 >> +  * can't re-enter recovery, so there's no need to keep checking 
 >> after the
 >> +  * shared variable has once been seen false.
 >> +  */
 >> + if (!LocalRecoveryInProgress)
 >> + return false;
 >> + else
 >> + {
 >> + /* use volatile pointer to prevent code rearrangement */
 >> + volatile XLogCtlData *xlogctl = XLogCtl;
 >> +
 >> + /* Intentionally query xlogctl without spinlocking! */
 >> + LocalRecoveryInProgress = 
 >> xlogctl->SharedRecoveryInProgress;
 >> +
 >> + return LocalRecoveryInProgress;
 >> + }
 >> + }
 >
 > I don't think it's acceptable to *set* LocalRecoveryInProgress
 > here. That should only be done in the normal routine.

 quite right -- that was a major error -- you could bypass the
 initialization call to the xlog with some bad luck.
>>>
>>> I've seen this in profiles since, so I'd appreciate pushing this
>>> forward.
>>
>> roger that -- will push ahead when I get into the office...
>
> attached is new version fixing some comment typos.

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.  The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read.   I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.  Aside: can this truly the only caller
of pg_read_barrier()?

Also, moving to -hackers from -performance.

merlin


recovery5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-09-27 Thread Greg Stark

On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby  wrote:
>
> Yeah, we obviously kept things simpler when adding forks in order to get the 
> feature out the door. There's improvements that need to be made. But IMHO 
> that's not reason to automatically avoid forks; we need to consider the cost 
> of improving them vs what we gain by using them.

We think this gives short change to the decision to introduce forks.
If you go back to the discussion at the time it was a topic of debate
and the argument which won the day is that interleaving different
streams of data in one storage system is exactly what the file system
is designed to do and we would just be reinventing the wheel if we
tried to do it ourselves. I think that makes a lot of sense for things
like the fsm or vm which grow indefinitely and are maintained by a
different piece of code from the main heap.

The tradeoff might be somewhat different for the pieces of a data
structure like a bitmap index or gin index where the code responsible
for maintaining it is all the same.

> Honestly, I think we actually need more obfuscation between what happens on 
> the filesystem and the rest of postgres... we're starting to look at areas 
> where that would help. For example, the recent idea of being able to truncate 
> individual relation files and not being limited to only truncating the end of 
> the relation. My concern in that case is that 1GB is a pretty arbitrary size 
> that we happened to pick, so if we're going to go for more efficiency in 
> storage we probably shouldn't just blindly stick with 1G (though of course 
> initial implementation might do that to reduce complexity, but we better 
> still consider where we're headed).

The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.

But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as "deprecated" and move tuples out of them until we can just remove
that whole file.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra functionality to createuser

2013-09-27 Thread Jim Nasby


On 9/27/13 6:01 AM, Robert Haas wrote:

Thus, the above commands might be replaced by:
>PGUSER=postgres createuser -D -S -l -g
>app_readonly_role,app2_writer_role my_new_user
>
>Would this be worth adding to the ToDo list?

I'd be inclined to favor a patch implementing this.


+1
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2013-09-27 Thread Jim Nasby


On 9/26/13 2:46 PM, Robert Haas wrote:

On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby  wrote:

Why would we add additional code complexity when forks do the trick? That
seems like a step backwards, not forward.


Well, they sorta do the trick, but see e.g. commit
ece01aae479227d9836294b287d872c5a6146a11.  I doubt that's the only
code that's poorly-optimized for multiple forks; IOW, every time
someone adds a new fork, there's a system-wide cost to that, even if
that fork is only used in a tiny percentage of the relations that
exist in the system.


Yeah, we obviously kept things simpler when adding forks in order to get the 
feature out the door. There's improvements that need to be made. But IMHO 
that's not reason to automatically avoid forks; we need to consider the cost of 
improving them vs what we gain by using them.

Of course there's always some added cost so we shouldn't just blindly use them 
all over the place without considering the fork cost either...


It's tempting to think that we can use the fork mechanism every time
we have multiple logical "streams" of blocks within a relation and
don't want to figure out a way to multiplex them onto the same
physical file.  However, the reality is that the fork mechanism isn't
up to the job.  I certainly don't want to imply that we shouldn't have
gone in that direction - both the fsm and the vm are huge steps
forward, and we wouldn't have gotten them in 8.4 without that
mechanism.  But they haven't been entirely without their own pain,
too, and that pain is going to grow the more we push in the direction
of relying on forks.


Agreed.

Honestly, I think we actually need more obfuscation between what happens on the 
filesystem and the rest of postgres... we're starting to look at areas where 
that would help. For example, the recent idea of being able to truncate 
individual relation files and not being limited to only truncating the end of 
the relation. My concern in that case is that 1GB is a pretty arbitrary size 
that we happened to pick, so if we're going to go for more efficiency in 
storage we probably shouldn't just blindly stick with 1G (though of course 
initial implementation might do that to reduce complexity, but we better still 
consider where we're headed).


If the only complaint about forks is directory traversal why wouldn't we go
with the well established practice of using multiple directories instead of
glomming everything into one place?


That's not the only complaint about forks - but I support what you're
proposing there anyhow, because it will be helpful to users with lots
of relations regardless of what we do or do not decide to do about
forks.




--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-27 Thread Robert Haas

On Fri, Sep 27, 2013 at 9:49 AM, Karl O. Pinc  wrote:
> Hi Robert,
> On 09/27/2013 05:56:52 AM, Robert Haas wrote:
>
>> 1. Attempting to encourage people to consider custom format dumps.
>
>> What's important is what you can do...
>
> Your critique seems obvious in retrospect.  Sorry you had
> to step in here and do my job.  The above point is particularly
> salient.  I will try to keep it in mind in the future.

Hey, your reviewing is much appreciated.  I actually didn't notice
your reply before I sent mine; I started writing it yesterday
afternoon and finished this morning, so I wasn't intending to preempt
you in any way.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] hstore extension version screwup

2013-09-27 Thread Andrew Dunstan



When adding json support to hstore, I made a major blunder and added the 
new functionality to the existing sql script instead of bumping the 
version, renaming the script and adding an update script.


This was lazy and there's no real excuse, although I will note that it 
was a mistake far too easy to make. Perhaps as a warning indicator we 
should remove write permissions from these files.


Anyway, I have had some discussions with Dimitri, and the best idea 
seems to be that we should do all the above, but in the update script 
use conditional logic that only adds the functions if they aren't 
already there and dependent on the extension. In the release notes we 
should advise anyone who has loaded hstore to run 'ALTER EXTENSION 
hstore UPDATE TO '1.2';"


The minor downside of this is that the upgrade script will depend on 
plpgsql be available. We'll need to note that too, although I don't 
recall the last time I came across a site that didn't have it loaded.


See attached for details of what's proposed.

cheers

andrew




diff --git a/contrib/hstore/hstore--1.1--1.2.sql 
b/contrib/hstore/hstore--1.1--1.2.sql
new file mode 100644
index 000..99b8a16
--- /dev/null
+++ b/contrib/hstore/hstore--1.1--1.2.sql
@@ -0,0 +1,47 @@
+/* contrib/hstore/hstore--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION hstore UPDATE TO '1.2'" to load this file. \quit
+
+
+-- A version of 1.1 was shipped with these objects mistakenly in 9.3.0.
+-- Therefore we only add them if we detect that they aren't already there and 
+-- dependent on the extension.
+
+DO LANGUAGE plpgsql
+
+$$
+
+BEGIN
+
+   PERFORM 1
+   FROM pg_proc p
+   JOIN  pg_depend d
+  ON p.proname = 'hstore_to_json_loose'
+AND d.objid = p.oid
+AND d.refclassid = 'pg_extension'::regclass
+   JOIN pg_extension x
+  ON d.refobjid = x.oid
+AND  x.extname = 'hstore';
+
+   IF NOT FOUND
+   THEN 
+
+CREATE FUNCTION hstore_to_json(hstore)
+RETURNS json
+AS 'MODULE_PATHNAME', 'hstore_to_json'
+LANGUAGE C IMMUTABLE STRICT;
+
+CREATE CAST (hstore AS json)
+  WITH FUNCTION hstore_to_json(hstore);
+
+CREATE FUNCTION hstore_to_json_loose(hstore)
+RETURNS json
+AS 'MODULE_PATHNAME', 'hstore_to_json_loose'
+LANGUAGE C IMMUTABLE STRICT;
+
+   END IF;
+
+END;
+
+$$;
diff --git a/contrib/hstore/hstore--1.1.sql b/contrib/hstore/hstore--1.2.sql
similarity index 100%
rename from contrib/hstore/hstore--1.1.sql
rename to contrib/hstore/hstore--1.2.sql
diff --git a/contrib/hstore/hstore.control b/contrib/hstore/hstore.control
index 4104e17..9daf5e2 100644
--- a/contrib/hstore/hstore.control
+++ b/contrib/hstore/hstore.control
@@ -1,5 +1,5 @@
 # hstore extension
 comment = 'data type for storing sets of (key, value) pairs'
-default_version = '1.1'
+default_version = '1.2'
 module_pathname = '$libdir/hstore'
 relocatable = true

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.1

2013-09-27 Thread Andres Freund

On 2013-09-27 16:35:53 +0100, Thom Brown wrote:
> On 27 September 2013 16:14, Andres Freund  wrote:
> > Hi,
> >
> > Attached you can find an updated version of the series taking in some of
> > the review comments (the others are queued, not ignored), including:
> > * split of things from the big "Introduce wal decoding via ..." patch
> > * fix the bug Steve notice where CreateLogicalDecodingContext was passed
> >   the wrong is_init = false where it should have been true
> > * A number of smaller bugs I noticed while reviewing
> > * Renaming of some variables, including guaranteedlyLogged ;)
> > * Comment improvements in decode.c
> > * rename pg_receivellog to pg_recvlogical
> >
> > I'll work more on the other points in the next days, so far they are
> > clear of other big stuff.
> >
> >
> > 0001 wal_decoding: Allow walsender's to connect to a specific database
> > - as before
> >
> > 0002 wal_decoding: Log xl_running_xact's at a higher frequency than 
> > checkpoints are done
> > - as before
> >
> > 0003 wal_decoding: Add information about a tables primary key to struct 
> > RelationData
> > - as before
> >
> > 0004 wal_decoding: Add wal_level = logical and log data required for 
> > logical decoding
> > - splitof patch that contains the wal format changes including the
> >   addition of a new wal_level option
> >
> > 0005 wal_decoding: Add option to treat additional tables as catalog tables
> > - Option to treat user defined table as a catalog table which means it
> >   can be accessed during logical decoding from an output plugin
> >
> > 0006 wal_decoding: Introduce wal decoding via catalog timetravel
> > - The guts of changeset extraction, without a user interface
> >
> > 0007 wal_decoding: logical changeset extraction walsender interface
> > - splitof patch containing the walsender changes, which allow to receive
> >   the changeset data in a streaming fashion, supporting sync rep and
> >   such fancy things
> >
> > 0008 wal_decoding: Only peg the xmin horizon for catalog tables during 
> > logical decoding
> > - splitof optimization which reduces the pain 06 introduces by pegging
> >   the xmin horizon to the smallest of the logical decoding slots. Now
> >   it's pegged differently for data tables than from catalog tables
> >
> > 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
> > - Example output plugin which is also used in tests
> >
> > 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for 
> > logical changes
> > - renamed client for the walsender interface
> >
> > 0011 wal_decoding: test_logical_decoding: Add extension for easier testing 
> > of logical decoding
> > - SQL SRF to get data from a decoding slot, also used as a vehicle for
> >   tests
> >
> > 0012 wal_decoding: design document v2.4 and snapshot building design doc 
> > v0.5
> 
> I'm encountering a make error:

Gah. Lastminute changes. Always the same... Updated patch attached.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0010-wal_decoding-pg_recvlogical-Introduce-pg_receivexlog.patch.gz
Description: application/patch-gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical changeset generation v6.1

2013-09-27 Thread Thom Brown

On 27 September 2013 16:14, Andres Freund  wrote:
> Hi,
>
> Attached you can find an updated version of the series taking in some of
> the review comments (the others are queued, not ignored), including:
> * split of things from the big "Introduce wal decoding via ..." patch
> * fix the bug Steve notice where CreateLogicalDecodingContext was passed
>   the wrong is_init = false where it should have been true
> * A number of smaller bugs I noticed while reviewing
> * Renaming of some variables, including guaranteedlyLogged ;)
> * Comment improvements in decode.c
> * rename pg_receivellog to pg_recvlogical
>
> I'll work more on the other points in the next days, so far they are
> clear of other big stuff.
>
>
> 0001 wal_decoding: Allow walsender's to connect to a specific database
> - as before
>
> 0002 wal_decoding: Log xl_running_xact's at a higher frequency than 
> checkpoints are done
> - as before
>
> 0003 wal_decoding: Add information about a tables primary key to struct 
> RelationData
> - as before
>
> 0004 wal_decoding: Add wal_level = logical and log data required for logical 
> decoding
> - splitof patch that contains the wal format changes including the
>   addition of a new wal_level option
>
> 0005 wal_decoding: Add option to treat additional tables as catalog tables
> - Option to treat user defined table as a catalog table which means it
>   can be accessed during logical decoding from an output plugin
>
> 0006 wal_decoding: Introduce wal decoding via catalog timetravel
> - The guts of changeset extraction, without a user interface
>
> 0007 wal_decoding: logical changeset extraction walsender interface
> - splitof patch containing the walsender changes, which allow to receive
>   the changeset data in a streaming fashion, supporting sync rep and
>   such fancy things
>
> 0008 wal_decoding: Only peg the xmin horizon for catalog tables during 
> logical decoding
> - splitof optimization which reduces the pain 06 introduces by pegging
>   the xmin horizon to the smallest of the logical decoding slots. Now
>   it's pegged differently for data tables than from catalog tables
>
> 0009 wal_decoding: test_decoding: Add a simple decoding module in contrib
> - Example output plugin which is also used in tests
>
> 0010 wal_decoding: pg_recvlogical: Introduce pg_receivexlog equivalent for 
> logical changes
> - renamed client for the walsender interface
>
> 0011 wal_decoding: test_logical_decoding: Add extension for easier testing of 
> logical decoding
> - SQL SRF to get data from a decoding slot, also used as a vehicle for
>   tests
>
> 0012 wal_decoding: design document v2.4 and snapshot building design doc v0.5

I'm encountering a make error:

install  pg_basebackup '/home/thom/Development/psql/bin/pg_basebackup'
install  pg_receivexlog '/home/thom/Development/psql/bin/pg_receivexlog'
install  pg_recvlogical(X) '/home/thom/Development/psql/bin/pg_receivellog'
/bin/dash: 1: Syntax error: "(" unexpected
make[3]: *** [install] Error 2
make[3]: Leaving directory
`/home/thom/Development/postgresql/src/bin/pg_basebackup'
make[2]: *** [install-pg_basebackup-recurse] Error 2

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-27 Thread Karl O. Pinc

Hi Robert,

On 09/27/2013 05:56:52 AM, Robert Haas wrote:

> 1. Attempting to encourage people to consider custom format dumps.

> What's important is what you can do...

Your critique seems obvious in retrospect.  Sorry you had
to step in here and do my job.  The above point is particularly
salient.  I will try to keep it in mind in the future.

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Florian Pflug

On Sep27, 2013, at 00:55 , Andres Freund  wrote:
> So the goal is to have LWLockAcquire(LW_SHARED) never block unless
> somebody else holds an exclusive lock. To produce enough appetite for
> reading the rest of the long mail, here's some numbers on a
> pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
> 
> master + padding: tps = 146904.451764
> master + padding + lwlock: tps = 590445.927065
> 
> That's rougly 400%.

Interesting. I played with pretty much the same idea two years or so ago.
At the time, I compared a few different LWLock implementations. Those
were AFAIR

  A) Vanilla LWLocks
  B) A + an atomic-increment fast path, very similar to your proposal
  C) B but with a partitioned atomic-increment counter to further
 reduce cache-line contention
  D) A with the spinlock-based queue replaced by a lockless queue

At the time, the improvements seemed to be negligible - they looked great
in synthetic benchmarks of just the locking code, but didn't translate
to improved TPS numbers. Though I think the only version that ever got
tested on more than a handful of cores was C…

My (rather hacked together) benchmarking code can be found here: 
https://github.com/fgp/lockbench.
The different LWLock implementations live in the various pg_lwlock_* subfolders.

Here's a pointer into the relevant thread: 
http://www.postgresql.org/message-id/651002c1-2ec1-4731-9b29-99217cb36...@phlo.org

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT...ON DUPLICATE KEY LOCK FOR UPDATE

2013-09-27 Thread Robert Haas

On Thu, Sep 26, 2013 at 11:58 PM, Peter Geoghegan  wrote:
> On Thu, Sep 26, 2013 at 12:15 PM, Robert Haas  wrote:
>>> Well, I think we can rule out value locks that are held for the
>>> duration of a transaction right away. That's just not going to fly.
>>
>> I think I agree with that.  I don't think I remember hearing that proposed.
>
> I think I might have been unclear - I mean locks that are held for the
> duration of *another* transaction, not our own, as we wait for that
> other transaction to commit/abort. I think that earlier remarks from
> yourself and Andres implied that this would be necessary. Perhaps I'm
> mistaken. Your most recent design proposal doesn't do this, but I
> think that that's only because it restricts the user to a single
> unique index - it would otherwise be necessary to sit on the earlier
> value locks (index tuples belonging to an unfinished transaction)
> pending the completion of some other conflicting transaction, which
> has numerous disadvantages (as described in my "it suits my purposes
> to have the value locks be held for only an instant" mail to you [1]).

OK, now I understand what you are saying.  I don't think I agree with it.

> I don't have another idea either. In fact, I'd go so far as to say
> that doing any third thing that's better than those two to any
> reasonable person is obviously impossible. But I'd add that we simple
> cannot rollback at read committed, so we're just going to have to hold
> our collective noses and do strange things with visibility.

I don't accept that as a general principal.  We're writing the code;
we can make it behave any way we think best.

> This is something that I haven't given remotely enough thought yet, so
> please take it with a big grain of salt.

I doubt that any change to HeapTupleSatisfiesMVCC() will be
acceptable.  This feature needs to restrain itself to behavior changes
that only affect users of this feature, I think.

> There is certainly value in considering that, and you're right to take
> that tact - it is generally valuable to have a patch be minimally
> invasive. However, ultimately that's just one aspect of any given
> design, an aspect that needs to be weighed against others where there
> is a tension. Obviously in this instance I believe, rightly or
> wrongly, that doing more - adding more infrastructure than might be
> considered strictly necessary - is the least worst thing. Also,
> sometimes the apparent similarity of a design to what we have today is
> illusory - certainly, I think you'd at least agree that the problems
> that bloating during the interim between value locking and row locking
> present are qualitatively different to other problems that bloat
> presents in all existing scenarios.

TBH, no, I don't think I agree with that.  See further below.

> Let me try and explain myself better, with reference to a concrete
> example. Suppose we have a table with a primary key column, A, and a
> unique constraint column, B, and we lock the pk value first and the
> unique constraint value second. I'm assuming your design, but allowing
> for multiple unique indexes because I don't think doing anything less
> will be accepted - promise tuples have some of the same problems, as
> well as some other idiosyncratic ones (see my earlier remarks on
> recovery/freezing [2] for examples of those).

OK, so far I'm right with you.

> So there is a fairly high probability that the pk value on A will be
> unique, and a fairly low probability that the unique constraint value
> on B will be unique, at least in this usage pattern of interest, where
> the user is mostly going to end up updating. Mostly, we insert a
> speculative regular index tuple (that points to a speculative heap
> tuple that we might decide to kill) into the pk column, A, right away,
> and then maybe block pending the resolution of a conflicting
> transaction on the unique constraint column B. I don't think we have
> any reasonable way of not blocking on A - if we go clean it up for the
> wait, that's going to bloat quite dramatically, *and* we have to WAL
> log. In any case you seemed to accept that cleaning up bloat
> synchronously like that was just going to be too expensive. So I
> suppose that rules that out. That just leaves sitting on the "value
> lock" (that the pk index tuple already inserted effectively is)
> indefinitely, pending the outcome of the first transaction.

Agreed.

> What are the consequences of sitting on that value lock indefinitely?
> Well, xacts are going to block on the pk value much more frequently,
> by simple virtue of the fact that the value locks there are held for a
> long time - they just needed to hear a "no" answer, which the unique
> constraint was in most cases happy to immediately give, so this is
> totally unnecessary. Contention is now in a certain sense almost as
> bad for every unique index as it is for the weakest link. That's only
> where the problems begin, though, and it isn't necessary for there to
> b

Re: [HACKERS] Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls

2013-09-27 Thread Robert Haas

On Thu, Sep 26, 2013 at 4:20 PM, Alvaro Herrera
 wrote:
> But how do we ensure that the BMS is allocated in a context?  You'd have
> to switch contexts each time you call bms_add_member.  I don't have a
> good answer to this.

The coding of bms_add_member is pretty funky.  Why doesn't it just
repalloc() the input argument if it's not big enough?  If it did that,
the new allocation would be in the same memory context as the original
one, and we'd not need to care about the memory context when adding an
element to an already non-empty set.

But even if we did decide to switch memory contexts on every call, it
would still be much cleaner than this.  It's only three lines of code
(and about as many instructions) to save and restore
CurrentMemoryContext.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extra functionality to createuser

2013-09-27 Thread Robert Haas

On Thu, Sep 26, 2013 at 1:04 PM, Christopher Browne  wrote:
> Sitting on my todo list for a while has been to consider the idea of
> adding a bit of additional functionality to createuser.
>
> One of the functions of CREATE ROLE is to associate the role with
> other roles, thus...
>
>create role my_new_user nosuperuser nocreatedb login
> IN ROLE app_readonly_role, app2_writer_role;
>
> That isn't something that I can do using createuser; to do that, I
> would need to submit two requests separately:
>
>PGUSER=postgres createuser -D -S -l my_new_user
>PGUSER=postgres psql -c "grant app_readonly_role, app2_writer_role
> to my_new_user;"
>
> I could certainly change over to using psql to do all the work, but it
> would be rather nice if createuser had (say) a "-g" option which
> allowed specifying the set of roles that should be assigned.
>
> Thus, the above commands might be replaced by:
>PGUSER=postgres createuser -D -S -l -g
> app_readonly_role,app2_writer_role my_new_user
>
> Would this be worth adding to the ToDo list?

I'd be inclined to favor a patch implementing this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] backup.sgml-cmd-v003.patch

2013-09-27 Thread Robert Haas

On Thu, Sep 26, 2013 at 1:15 PM, Ivan Lezhnjov IV
 wrote:
> Thanks for a detailed response. I attached a patch file that builds on your 
> corrections and introduces some of the edits discussed above.

This patch makes at least five unrelated sets of changes:

1. Attempting to encourage people to consider custom format dumps.
2. Suggesting that superuser access isn't necessary to dump the database.
3. Adding a note that OIDs on user objects are deprecated.
4. Describing the manner in which pg_dump can be used with pg_dumpall
to back up all databases in a cluster.
5. Promoting pigz.

It's not a good idea to bundle so many unrelated changes together into
a single patch, because if there is disagreement about any item on the
list, then the patch doesn't go in.  Moreover, we typically try to
commit one change at a time, so this leaves the eventual committer
with the unenviable task of splitting up the patch for commit.

I think that #3 and #5 have no business are things we shouldn't do at
all.  There are many compression utilities in the world other than
gzip, and everyone has their favorites.  I see no reason why we should
promote one over the other.  Certainly, the fact that you yourself
have found pigz useful is not sufficient evidence that everyone should
prefer it.  And, while it's true that OIDs on user objects are
deprecated, I don't think the pg_dump documentation necessarily needs
to recapitulate this point particularly; hopefully, it's documented
elsewhere; if not, this isn't the right place to mention it.  If,
contrary to my feelings on the matter, we do decide to make a change
in either of these areas, it's unrelated to the rest of this patch and
needs to be submitted and discussed separately.

I think that #2 is probably a worthwhile change in some form.
However, I don't particularly agree with the way you've worded it
here.  This section is trying to give a general overview of how to
back up your whole database; it's not the pg_dump reference page,
which exists elsewhere.  I would suggest that the relevant caveat here
is not so much that you might have a non-superuser account that just
happens to have enough privileges to back up the entire database, but
rather that we shouldn't give people the impression that the only
possible use of pg_dump is as for a full backup.  Because in my
experience, a full-database backup almost always DOES require
superuser privileges; it's the selective-dump case where you can often
get by without them.  I've attached a proposed patch along these lines
for your consideration.

The remaining changes (#1 and #4) can probably be blended together in
a single patch.  However, I think that you need to make more of an
effort to use neutral wording.  The purpose of the documentation is
not to pass judgement on the tools.  Examples:

+   The above example creates a plain text file.  This type of dump can be used
+   to restore a database in full. However there are more sophisticated
+   PostgreSQL backup formats which allow for far greater
+   control when working with backups.  One of these is
+   the custom format, which the following more elaborate
+   example creates:

I don't find it helpful to classify the other formats as more
sophisticated or to say that they allow far greater control, or to
call the example elaborate.  What's important is what you can do, and
there's basically one thing: selective restores.  So I think you could
replace all that, the following example, and the paragraph after that
with: The above example creates a plain text file.  pg_dump can also
create backups in other formats, such as the custom format.  This may
be advantageous, since all formats other than the plain text file
format make it possible to selectively restore objects from the dump.
See  for more details.

+   
+Unfortunately, pg_dumpall can only create plaintext
+backups. However, it is currently the only way to backup the globals in you

Similarly, let's not label pg_dumpall's behavior as unfortunate.  I
think it's appropriate to explain the pg_dumpall -g / pg_dump -Fc
combination somewhere in the documentation, as it is widely used and
very useful.  But I don't think it's useful to imply to users that the
tools are inadequate; generally, they're not, improvements that we'd
like to make nonwithstanding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index ccb76d8..54681a0 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -47,8 +47,13 @@ pg_dump dbname > pg_dump
does not operate with special permissions. In particular, it must
-   have read access to all tables that you want to back up, so in
-   practice you almost always have to run it as a database superuser.
+   have read access to all tables that you want to back up, so in order
+   to back up the entire database you almost always have to run it as a
+   database superuser.  (If you do

Re: [HACKERS] insert throw error when year field len > 4 for timestamptz datatype

2013-09-27 Thread Haribabu kommi

On 27 September 2013 15:04 Rushabh Lathia wrote:
>>On Tue, Sep 17, 2013 at 6:23 PM, Haribabu kommi 
>>mailto:haribabu.ko...@huawei.com>> wrote:
>>I feel changing the year value to accept the length (>4) is not simple.
>>So many places the year length crossing more than length 4 is not considered.
>>Search in the code with "" and correct all related paths.

>Right, changing the year value to accept the length (>4) is not simple because 
>so
>many places the year length crossing plus most of the please having assumption
>that it will be always <4.

>Tried to fix issue more couple of places but I don't feeling like its always 
>going
>to be safe to assume that we covered all path.

>Still looking and wondering if we can do change in any simple place or whether
>we can find any other smarter way to fix the issue.

If the changes are very high to deal all scenarios,
I feel it is better do it only in scenarios where the use cases needs it, until 
it is not confusing users.
The rest can be documented.
Any other opinions/suggestions welcome.

Regards,
Hari babu.

Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-27 Thread Sawada Masahiko

On Fri, Sep 27, 2013 at 5:18 PM, Pavan Deolasee
 wrote:
> On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko 
> wrote:
>>
>>
>> >
>>
>> Thank you for comment. I think it is good simple idea.
>> In your opinion, if synchronous_transfer is set 'all' and
>> synchronous_commit is set 'on',
>> the master wait for data flush eve if user sets synchronous_commit to
>> 'local' or 'off'.
>> For example, when user want to do transaction early, user can't do this.
>> we leave the such situation as constraint?
>>
>
> No, user can still override the transaction commit point wait. So if
>
> synchronous_transfer is set to "all":
>  - If synchronous_commit is ON - wait at all points
>  - If synchronous_commit is OFF - wait only at buffer flush (and other
> related to failback safety) points
>
> synchronous_transfer is set to "data_flush":
>  - If synchronous_commit is either ON o OFF - do not wait at commit points,
> but wait at all other points
>
> synchronous_transfer is set to "commit":
>  - If synchronous_commit is ON - wait at commit point
>  - If synchronous_commit is OFF - do not wait at any point
>

Thank you for explain. Understood.
if synchronous_transfer is set 'all' and user changes
synchronous_commit to 'off'( or 'local') at a transaction,
the master server wait at buffer flush, but doesn't wait at commit
points. Right?

In currently patch, synchronous_transfer works in cooperation with
synchronous_commit.
But if user changes synchronous_commit at a transaction, they are not
in cooperation.
So, your idea might be better than currently behaviour of synchronous_transfer.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] insert throw error when year field len > 4 for timestamptz datatype

2013-09-27 Thread Rushabh Lathia

Sorry for delay in reply.



On Tue, Sep 17, 2013 at 6:23 PM, Haribabu kommi
wrote:

>  On Tue, 17 September 2013 14:33 Rushabh Lathia wrote:
>
> >>On Mon, Sep 16, 2013 at 7:22 PM, Haribabu kommi <
> haribabu.ko...@huawei.com> wrote:
>
> *>>>**On *14 August 2013 Rushabh Lathia wrote:
>
>  >>>postgres=# create table test ( a timestamptz);
>
> >>>CREATE TABLE
>
>  >>>-- Date with year 1000
>
> >>>postgres=#  insert into test values ( 'Sat Mar 11 23:58:48 1000 IST');*
> ***
>
> >>>INSERT 0 1
>
>  >>>-- Now try with year 1 it will return error
>
> >>>postgres=#  insert into test values ( 'Sat Mar 11 23:58:48 1 IST');
> 
>
> >>>ERROR:  invalid input syntax for type timestamp with time zone: "Sat
> Mar 11 23:58:48 1 IST" 
>
> >>>LINE 1: insert into test values ( 'Sat Mar 11 23:58:48 1 IST');
>
> >>Patch applies cleanly to HEAD. As this patch tries to improve in
> inserting the date of the year value to be more than 4 in length.
>
> >>But it didn’t solve all the ways to insert the year field more than 4
> in length. Please check the following test.
>
> ** **
>
>  >>postgres=# insert into test values ('10001010 10:10:10 IST');
>
> >>INSERT 0 1
>
> >>postgres=# insert into test values ('100011010 10:10:10 IST');
>
> >>ERROR:  invalid input syntax for type timestamp with time zone:
> "100011010 10:10:10 IST" at character 26
>
> >>STATEMENT:  insert into test values ('100011010 10:10:10 IST');
>
> >>ERROR:  invalid input syntax for type timestamp with time zone:
> "100011010 10:10:10 IST"
>
> >>LINE 1: insert into test values ('100011010 10:10:10 IST');
>
> ^
>
>  >>I feel it is better to provide the functionality of inserting year
> field more than 4 in length in all flows.
>
> ** **
>
> >+1. Nice catch.
>
> ** **
>
> >Here is the latest version of patch which handles the functionality in
> all flows. 
>
> >Could you test it and share you comments.
>
> ** **
>
> I am getting some other failures with the updated patch also, please check
> the following tests.
>
> ** **
>
> select date 'January 8, 19990';
>
> select timestamptz 'January 8, 199910 01:01:01 IST';
>
> INSERT INTO TIMESTAMPTZ_TST VALUES(4, '10001 SAT 8 MAR 10:10:10 IST');
>
> ** **
>
> you can get the test scripts from regress test files of date.sql,
> timetz.sql, timestamp.sql and timestamptz.sql
>
> and modify according to the patch for verification.
>
> ** **
>
> I feel changing the year value to accept the length (>4) is not simple. **
> **
>
> So many places the year length crossing more than length 4 is not
> considered.
>
> Search in the code with “” and correct all related paths.
>

Right, changing the year value to accept the length (>4) is not simple
because so
many places the year length crossing plus most of the please having
assumption
that it will be always <4.

Tried to fix issue more couple of places but I don't feeling like its
always going
to be safe to assume that we covered all path.

Still looking and wondering if we can do change in any simple place or
whether
we can find any other smarter way to fix the issue.



> 
>
> ** **
>
> Regards,
>
> Hari babu.
>
> ** **
>



-- 
Rushabh Lathia

Re: [HACKERS] Patch for fast gin cache performance improvement

2013-09-27 Thread Etsuro Fujita

I wrote:
> I had a look over this patch.  I think this patch is interesting and very
useful.
> Here are my review points:

> 8. I think there are no issues in this patch.  However, I have one question:
> how this patch works in the case where gin_fast_limit/fast_cache_size = 0?  In
> this case, in my understanding, this patch inserts new entries into the
pending
> list temporarily and immediately moves them to the main GIN data structure
using
> ginInsertCleanup().  Am I right?  If so, that is obviously inefficient.

Sorry, There are incorrect expressions.  I mean gin_fast_limit > 0 and
fast_cache_size = 0.

Although I asked this question, I've reconsidered about these parameters, and it
seems that these parameters not only make code rather complex but are a little
confusing to users.  So I'd like to propose to introduce only one parameter:
fast_cache_size.  While users that give weight to update performance for the
fast update technique set this parameter to a large value, users that give
weight not only to update performance but to search performance set this
parameter to a small value.  What do you think about this?

Thanks,

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund

On 2013-09-27 11:11:56 +0300, Heikki Linnakangas wrote:
> On 27.09.2013 10:21, Andres Freund wrote:
> >Heaps better. In the case causing this investigation lots of the pages
> >with hot spinlocks were the simply the same ones over and over again,
> >partitioning the lockspace won't help much there.
> >That's not exactly an uncommon scenario since often enough there's a
> >small amount of data hit very frequently and lots more that is accessed
> >only infrequently. E.g. recently inserted data and such tends to be very hot.
>
> I see. So if only a few buffers are really hot, I'm assuming the problem
> isn't just the buffer partition lock, but also the spinlock on the buffer
> header, and the buffer content lwlock. Yeah, improving LWLocks would be a
> nice wholesale solution to that. I don't see any fundamental flaw in your
> algorithm. Nevertheless, I'm going to throw in a couple of other ideas:
>
> * Keep a small 4-5 entry cache of buffer lookups in each backend of most
> recently accessed buffers. Before searching for a buffer in the
> SharedBufHash, check the local cache.

I thought about that as well, but you'd either need to revalidate after
pinning the buffers, or keep them pinned.
I had a very hacky implementation of that, but it just make the buffer
content locks the top profile spots. Similar issue there.

It might be worthwile to do this nonetheless - lock xadd; certainly
isn't cheap, even due it's cheaper than a full spinnlock - but it's not
trivial.

> * To pin a buffer, use an atomic fetch-and-add instruction to increase the
> refcount. PinBuffer() also increases usage_count, but you could do that
> without holding a lock; it doesn't need to be accurate.

Yes, Pin/UnpinBuffer() are the primary contention points after this
patch. I want to tackle them, but that seems like a separate thing to
do.

I think we should be able to get rid of most or even all LockBufHdr()
calls by
a) introducing PinBufferInternal() which increase pins but not
usage count using an atomic increment. That can replace locking headers
in many cases.
b) make PinBuffer() increment pin and usagecount using a single 64bit
atomic add if available and fit flags in there as well. Then back off
the usagecount if it's too high or even wraps around, that doesn't hurt
much, we're pinned in that moment.

> One problem with your patch is going to be to make it also work without the
> CAS and fetch-and-add instructions. Those are probably present in all the
> architectures we support, but it'll take some effort to get the
> architecture-specific code done. Until it's all done, it would be good to be
> able to fall back to plain spinlocks, which we already have. Also, when
> someone ports PostgreSQL to a new architecture in the future, it would be
> helpful if you wouldn't need to write all the architecture-specific code
> immediately to get it to compile.

I think most recent compilers have intrinsics for stuff for operations
like that. I quite like the API provided by gcc for this kind of stuff,
I think we should model an internal wrapper API similarly. I don't see
any new platforming coming that has a compiler without intrinsics?

But yes, you're right, I think we need a spinlock based fallback for
now. Even if it's just because nobody of us can verify the
implementations on the more obsolete platforms we claim to support.I
just didn't see it as a priority in the PoC.

> Did you benchmark your patch against the compare-and-swap patch I posted
> earlier? (http://www.postgresql.org/message-id/519a3587.80...@vmware.com).
> Just on a theoretical level, I would assume your patch to scale better since
> it uses fetch-and-add instead of compare-and-swap for acquiring a shared
> lock. But in practice it might be a wash.

I've tried compare-and-swap for shared acquisition and it performed
worse, didn't try your patch though as you seemed to have concluded it's
a wash after doing the unlocked test.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-27 Thread Pavan Deolasee

On Fri, Sep 27, 2013 at 1:28 PM, Sawada Masahiko wrote:

>
> >
>
> Thank you for comment. I think it is good simple idea.
> In your opinion, if synchronous_transfer is set 'all' and
> synchronous_commit is set 'on',
> the master wait for data flush eve if user sets synchronous_commit to
> 'local' or 'off'.
> For example, when user want to do transaction early, user can't do this.
> we leave the such situation as constraint?
>
>
No, user can still override the transaction commit point wait. So if

synchronous_transfer is set to "all":
 - If synchronous_commit is ON - wait at all points
 - If synchronous_commit is OFF - wait only at buffer flush (and other
related to failback safety) points

synchronous_transfer is set to "data_flush":
 - If synchronous_commit is either ON o OFF - do not wait at commit points,
but wait at all other points

synchronous_transfer is set to "commit":
 - If synchronous_commit is ON - wait at commit point
 - If synchronous_commit is OFF - do not wait at any point

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund

On 2013-09-27 09:57:07 +0200, Andres Freund wrote:
> On 2013-09-27 09:21:05 +0200, Andres Freund wrote:
> > > >So the goal is to have LWLockAcquire(LW_SHARED) never block unless
> > > >somebody else holds an exclusive lock. To produce enough appetite for
> > > >reading the rest of the long mail, here's some numbers on a
> > > >pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
> > > >
> > > >master + padding: tps = 146904.451764
> > > >master + padding + lwlock: tps = 590445.927065
> > > 
> > > How does that compare with simply increasing NUM_BUFFER_PARTITIONS?
> > 
> > Heaps better. In the case causing this investigation lots of the pages
> > with hot spinlocks were the simply the same ones over and over again,
> > partitioning the lockspace won't help much there.
> > That's not exactly an uncommon scenario since often enough there's a
> > small amount of data hit very frequently and lots more that is accessed
> > only infrequently. E.g. recently inserted data and such tends to be very 
> > hot.
> > 
> > I can run a test on the 4 socket machine if it's unused, but on my 2
> > socket workstation the benefits of at least our simulation of the
> > original workloads the improvements were marginal after increasing the
> > padding to a full cacheline.
> 
> Ok, was free:
> 
> padding + 16 partitions:
> tps = 147884.648416
> 
> padding + 32 partitions:
> tps = 141777.841125
> 
> padding + 64 partitions:
> tps = 141561.539790
> 
> padding + 16 partitions + new lwlocks
> tps = 601895.580903 (yeha, still reproduces after some sleep!)

Pgbench numbers for writes on the machine (fsync = off,
synchronous_commit = off):
padding + 16 partitions:
tps = 8903.532163
vs
padding + 16 partitions + new lwlocks
tps = 13324.080491

So, on bigger machines the advantages seem to be there for writes as
well...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Heikki Linnakangas


On 27.09.2013 10:21, Andres Freund wrote:

Hi,

On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote:

On 27.09.2013 01:55, Andres Freund wrote:

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065


How does that compare with simply increasing NUM_BUFFER_PARTITIONS?


Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.


I see. So if only a few buffers are really hot, I'm assuming the problem 
isn't just the buffer partition lock, but also the spinlock on the 
buffer header, and the buffer content lwlock. Yeah, improving LWLocks 
would be a nice wholesale solution to that. I don't see any fundamental 
flaw in your algorithm. Nevertheless, I'm going to throw in a couple of 
other ideas:


* Keep a small 4-5 entry cache of buffer lookups in each backend of most 
recently accessed buffers. Before searching for a buffer in the 
SharedBufHash, check the local cache.


* To pin a buffer, use an atomic fetch-and-add instruction to increase 
the refcount. PinBuffer() also increases usage_count, but you could do 
that without holding a lock; it doesn't need to be accurate.



One problem with your patch is going to be to make it also work without 
the CAS and fetch-and-add instructions. Those are probably present in 
all the architectures we support, but it'll take some effort to get the 
architecture-specific code done. Until it's all done, it would be good 
to be able to fall back to plain spinlocks, which we already have. Also, 
when someone ports PostgreSQL to a new architecture in the future, it 
would be helpful if you wouldn't need to write all the 
architecture-specific code immediately to get it to compile.


Did you benchmark your patch against the compare-and-swap patch I posted 
earlier? 
(http://www.postgresql.org/message-id/519a3587.80...@vmware.com). Just 
on a theoretical level, I would assume your patch to scale better since 
it uses fetch-and-add instead of compare-and-swap for acquiring a shared 
lock. But in practice it might be a wash.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch for fail-back without fresh backup

2013-09-27 Thread Sawada Masahiko

On Thu, Sep 26, 2013 at 8:54 PM, Pavan Deolasee
 wrote:
>
> On Thu, Sep 19, 2013 at 4:02 PM, Fujii Masao  wrote:
>>
>>
>>
>> Hmm... when synchronous_transfer is set to data_flush,
>> IMO the intuitive behaviors are
>>
>> (1) synchronous_commit = on
>> A data flush should wait for the corresponding WAL to be
>> flushed in the standby
>>
>> (2) synchronous_commit = remote_write
>> A data flush should wait for the corresponding WAL to be
>> written to OS in the standby.
>>
>> (3) synchronous_commit = local
>> (4) synchronous_commit = off
>> A data flush should wait for the corresponding WAL to be
>> written locally in the master.
>
>
> I thought synchronous_commit and synchronous_transfer are kind of orthogonal
> to each other. synchronous_commit only controls whether and how to wait for
> the standby only when a transaction commits. synchronous_transfer OTOH tells
> how to interpret the standby listed in synchronous_standbys parameter. If
> set to "commit" then they are synchronous standbys (like today). If set to
> "data_flush", they are asynchronous failback safe standby and if set to
> "all" then they are synchronous failback safe standbys. Well, its confusing
> :-(
>
> So IMHO in the current state of things, the synchronous_transfer GUC can not
> be changed at a session/transaction level since all backends, including
> background workers must honor the settings to guarantee failback safety.
> synchronous_commit still works the same way, but is ignored if
> synchronous_transfer is set to "data_flush" because that effectively tells
> us that the standbys listed under synchronous_standbys are really *async*
> standbys with failback safety.
>

Thank you for comment. I think it is good simple idea.
In your opinion, if synchronous_transfer is set 'all' and
synchronous_commit is set 'on',
the master wait for data flush eve if user sets synchronous_commit to
'local' or 'off'.
For example, when user want to do transaction early, user can't do this.
we leave the such situation as constraint?

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund

On 2013-09-27 09:21:05 +0200, Andres Freund wrote:
> > >So the goal is to have LWLockAcquire(LW_SHARED) never block unless
> > >somebody else holds an exclusive lock. To produce enough appetite for
> > >reading the rest of the long mail, here's some numbers on a
> > >pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
> > >
> > >master + padding: tps = 146904.451764
> > >master + padding + lwlock: tps = 590445.927065
> > 
> > How does that compare with simply increasing NUM_BUFFER_PARTITIONS?
> 
> Heaps better. In the case causing this investigation lots of the pages
> with hot spinlocks were the simply the same ones over and over again,
> partitioning the lockspace won't help much there.
> That's not exactly an uncommon scenario since often enough there's a
> small amount of data hit very frequently and lots more that is accessed
> only infrequently. E.g. recently inserted data and such tends to be very hot.
> 
> I can run a test on the 4 socket machine if it's unused, but on my 2
> socket workstation the benefits of at least our simulation of the
> original workloads the improvements were marginal after increasing the
> padding to a full cacheline.

Ok, was free:

padding + 16 partitions:
tps = 147884.648416

padding + 32 partitions:
tps = 141777.841125

padding + 64 partitions:
tps = 141561.539790

padding + 16 partitions + new lwlocks
tps = 601895.580903 (yeha, still reproduces after some sleep!)


Now, the other numbers were best-of-three, these aren't, but I think
it's pretty clear that you're not going to see the same benefits. I am
not surprised...
The current implementation of lwlocks will frequently block others, both
during acquiration and release of locks. What's even worse, trying to
fruitlessly acquire a spinlock will often prevent releasing it because
we need the spinlock during release.
With the proposed algorithm, even if we need the spinlock during release
of an lwlock because there are queued PGPROCs, we will acquire that
spinlock only after already having released the lock...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Andres Freund

Hi,

On 2013-09-27 10:14:46 +0300, Heikki Linnakangas wrote:
> On 27.09.2013 01:55, Andres Freund wrote:
> >We have had several customers running postgres on bigger machines report
> >problems on busy systems. Most recently one where a fully cached
> >workload completely stalled in s_lock()s due to the *shared* lwlock
> >acquisition in BufferAlloc() around the buffer partition lock.
> >
> >Increasing the padding to a full cacheline helps making the partitioning
> >of the partition space actually effective (before it's essentially
> >halved on a read-mostly workload), but that still leaves one with very
> >hot spinlocks.
> >
> >So the goal is to have LWLockAcquire(LW_SHARED) never block unless
> >somebody else holds an exclusive lock. To produce enough appetite for
> >reading the rest of the long mail, here's some numbers on a
> >pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620
> >
> >master + padding: tps = 146904.451764
> >master + padding + lwlock: tps = 590445.927065
> 
> How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

Heaps better. In the case causing this investigation lots of the pages
with hot spinlocks were the simply the same ones over and over again,
partitioning the lockspace won't help much there.
That's not exactly an uncommon scenario since often enough there's a
small amount of data hit very frequently and lots more that is accessed
only infrequently. E.g. recently inserted data and such tends to be very hot.

I can run a test on the 4 socket machine if it's unused, but on my 2
socket workstation the benefits of at least our simulation of the
original workloads the improvements were marginal after increasing the
padding to a full cacheline.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition

2013-09-27 Thread Heikki Linnakangas


On 27.09.2013 01:55, Andres Freund wrote:

Hello,

We have had several customers running postgres on bigger machines report
problems on busy systems. Most recently one where a fully cached
workload completely stalled in s_lock()s due to the *shared* lwlock
acquisition in BufferAlloc() around the buffer partition lock.

Increasing the padding to a full cacheline helps making the partitioning
of the partition space actually effective (before it's essentially
halved on a read-mostly workload), but that still leaves one with very
hot spinlocks.

So the goal is to have LWLockAcquire(LW_SHARED) never block unless
somebody else holds an exclusive lock. To produce enough appetite for
reading the rest of the long mail, here's some numbers on a
pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620

master + padding: tps = 146904.451764
master + padding + lwlock: tps = 590445.927065


How does that compare with simply increasing NUM_BUFFER_PARTITIONS?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some interesting news about Linux 3.12 OOM

2013-09-27 Thread Daniel Farina

On Wed, Sep 25, 2013 at 8:00 AM, Greg Stark  wrote:
>
> On Wed, Sep 25, 2013 at 12:15 AM, Daniel Farina  wrote:
>>
>> Enable the memcg OOM killer only for user faults, where it's really the
>> only option available.
>
>
> Is this really a big deal? I would expect most faults to be user faults.
>
> It's certainly a big deal that we need to ensure we can handle ENOMEM from
> syscalls and library functions we weren't expecting to return it. But I
> don't expect it to actually reduce the OOM killing sprees by much.

Hmm, I see what you mean.  I have been reading through the mechanism:
I got too excited about 'allocations by system calls', because I
thought that might mean brk  and friends, except that's not much of an
allocation at all, just reservation.  I think.

There is some interesting stuff coming in along with these patches in
bringing the user-space memcg OOM handlers up to snuff that may make
it profitable to issue SIGTERM to backends when a safety margin is
crossed (too bad the error messages will be confusing in that case).
I was rather hoping that a regular ENOMEM could be injected by this
mechanism the next time a syscall is touched (unknown), but I'm not
confident if this is made easier or not, one way or another.  One
could imagine the kernel injecting such a fault when the amount of
memory being consumed starts to look hairy, but I surmise part of the
impetus for userspace handling of that is to avoid getting into that
particular heuristics game.

Anyway, I did do some extensive study of cgroups and memcg's
implementation in particular and found it not really practical for
Postgres use unless one was happy with lots and lots of database
restarts, and this work still gives me some hope to try again, even if
smaller modifications still seem necessary.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench filler columns

2013-09-27 Thread Pavan Deolasee

On Thu, Sep 26, 2013 at 7:20 PM, Noah Misch  wrote:

> On Thu, Sep 26, 2013 at 03:23:30PM +0530, Pavan Deolasee wrote:
>
> > Should we just fix the comment and say its applicable for all tables
> except
> > accounts ?
>
> Please do.
>
>
How about something like this ? Patch attached.

Thanks,
Pavan


-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


pgbench_filler_column_notes.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

41 matches

Mail list logo