date:20140925

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-25 Thread Fabien COELHO



Hello Robert,


I think you're making a mountain out of a molehill.


Probably. I tend to try the minimum effort first.


I implemented this today in about three hours.  The patch is attached.


Great!

Your patch is 544 lines, my size evaluation was quite good:-)

Note that I probably spent 10 minutes on the 10 lines patch, but I 
probably spent hours writing mails about it.


It needs more testing, documentation, and possibly some makefile 
adjustments, but it seems to basically work.


I'll try to do that, but not in the short term.

Note that the modulo is not the one I need, but probably I can make do 
with an abs() function and/or a conditional.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-25 Thread Fabien COELHO



I think you're making a mountain out of a molehill. I implemented this 
today in about three hours.


I think you're greatly understating your own efficiency at shift/reducing 
parser mountains down to molehills.  Fabien even guessed the LOC size of the 
resulting patch with less than a 9% error.


As I do research and also teach compilers, it was not that hard for me to 
provide a reasonable size estimation.


A also noticed Robert's 3 lines per minutes outstanding productivity. Once 
you include argumenting, testing, debuging and documenting, that may go 
down a bit, but it is very good anyway.


You can also notice that in the patch the handling of '%' involves 11 
lines (1 lexer, 1 parser, 9 executor), basically the very same size as the 
patch I submitted, for a very similar code.


[...].  Yes, the PostgreSQL community is hostile to 
short, targeted feature improvements unless they come already fit into a 
large, standards compliant strategy for improving the database.  That doesn't 
work well when approached by scratch an itch stye development.


Indeed I tend to avoid spending hours on something when I can spend 
minutes, and if I spend hours I'm really cross if a patch is rejected, 
whereas I'm less crossed if I just lost minutes.


I had bad experiences with patch submissions in the past when things are 
changed and someone does not want it, and it seems that this patch falls

within this risk, hence my reluctance to spend the time.

Now, as the patch is submitted by a core dev, I think the likelyhood that 
some version will be committed is high, so all is well, and I gladly 
review and test it when I have time, alas not in the short term.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selectivity estimation for inet operators

2014-09-25 Thread Heikki Linnakangas


On 09/07/2014 07:09 PM, Emre Hasegeli wrote:

I updated the patch to cover semi and anti joins with eqjoinsel_semi().
I think it is better than returning a constant.


What you did there is utterly unacceptable from a modularity standpoint;
and considering that the values will be nowhere near right, the argument
that it's better than returning a constant seems pretty weak.  I think
you should just take that out again.


I will try to come up with a better, data type specific implementation
in a week.


New version with semi join estimation function attached.


Thanks. Overall, my impression of this patch is that it works very well. 
But damned if I understood *how* it works :-). There's a lot of 
statistics involved, and it's not easy to see why something is 
multiplied by something else. I'm adding comments as I read through it.


I've gotten to the inet_semi_join_selec function:


/*
 * Inet semi join selectivity estimation.
 */
static Selectivity
inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues,
 bool his2_exists, Datum *his2_values, 
int his2_nvalues,
 double his2_weight, Datum *constvalue,
 FmgrInfo *proc, short opr_order)
{
if (mcv2_exists)
{
int i;

for (i = 0; i  mcv2_nvalues; i++)
{
if (DatumGetBool(FunctionCall2Coll(proc, 
DEFAULT_COLLATION_OID,

   *constvalue, mcv2_values[i])))
return 1.0;
}
}

/* Do not bother if histogram weight is smaller than 0.1. */
if (his2_exists  his2_weight  0.1)
{
Selectivity his_selec;

his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues,

 constvalue, opr_order);

if (his_selec  0)
return Min(1.0, his2_weight * his_selec);
}

return 0.0;
}


This desperately needs comment at the top of the function explaining 
what it does. Let me try to explain what I think it does:


This function calculates the probability that there is at least one row 
in table B, which satisfies the constant op column qual. The constant 
is passed as argument, and for table B, the MCV list and histogram is 
provided. his2_weight is the total number of rows in B that are covered 
by the histogram. For example, if the table has 1000 rows, and 10% of 
the rows in the table are in the MCV, and another 10% are NULLs, 
his_weight would be 800.


First, we check if the constant matches any of the most common values. 
If it does, return 1.0, because then there is surely a match.


Next, we use the histogram to estimate the number of rows in the table 
that matches the qual. If it amounts to more than 1 row, we return 1.0. 
If it's between 0.0 and 1.0 rows, we return that number as the probability.



Now, I think that last step is wrong. Firstly, the Do not bother if 
histogram weight is smaller than 0.1 rule seems bogus. The his2_weight 
is the total number of rows represented by the histogram, so surely it 
can't be less than 1. It can't really be less than the statistics 
target. Unless maybe if the histogram was collected when the table was 
large, but it has since shrunk to contain only a few rows, but that 
seems like a very bizarre corner case. At least it needs more comments 
explaining what the test is all about, but I think we should just always 
use the histogram (if it's available).


Secondly, if we estimate that there is on average 1.0 matching row in 
the table, it does not follow that the probability that at least one row 
matches is 1.0. Assuming a gaussian distribution with mean 1.0, the 
probability that at least one row matches is 0.5. Assuming a gaussian 
distribution here isn't quite right - I guess a Poisson distribution 
would be more accurate - but it sure doesn't seem right as it is.


The error isn't very big, and perhaps you don't run into that very 
often, so I'm not sure what the best way to fix that would be. My 
statistics skills are a bit rusty, but I think the appropriate way would 
be to apply the Poisson distribution, with the estimated number of 
matched rows as the mean. The probability of at least one match would be 
the cumulative distribution function at k=1. It sounds like overkill, if 
this is case occurs only rarely. But then again, perhaps it's not all 
that rare.


That said, I can't immediately find a test case where that error would 
matter. I tried this:


create table inettbl1 (a inet);
insert into inettbl1 select '10.0.0.' || (g % 255) from 
generate_series(1, 10) g;

analyze inettbl1;
explain analyze select count(*) from inettbl1 where a = ANY

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2014-09-25 Thread Andres Freund

On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote:
 At 2014-09-24 11:09:24 +0200, and...@2ndquadrant.com wrote:
  I think it's completely unacceptable to copy a visibility routine.
 
 OK. Which visibility routine should I use, and should I try to create a
 variant that doesn't set hint bits?

I've not yet followed your premise that you actually need one that
doesn't set hint bits...

 I don't have any reasoning for why it's safe to not hold a content lock.
 If there is one, I need help to find it. If not, I'll acquire a content
 lock. (If anyone can explain why it isn't safe, I would appreciate it.)

I don't see why it'd be safe. Without the content lock you can come
across half written tuple headers and similar things. You can try to
argue that all the possible faults of that are harmless, but I think
that'd require a tremendous amount of code review and it'd be hard to
continue to guarantee. There's a reason we acquire locks, you know :)

I think it's saner to first get this working  committed properly *with*
the lock. If you afterwards have energy to improve it further we can
another look.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2014-09-25 Thread Abhijit Menon-Sen

At 2014-09-25 11:41:29 +0200, and...@2ndquadrant.com wrote:

 I've not yet followed your premise that you actually need one that
 doesn't set hint bits...

Oh.

All right, then I'll post a version that addresses Amit's other points,
adds a new file/function to pgstattuple, acquires content locks, and
uses HeapTupleSatisfiesVacuum, hint-bit setting and all. Maybe I'll
call it not_too_slow_bloat().

Thank you for the feedback.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make pg_controldata accept -D dirname

2014-09-25 Thread Heikki Linnakangas


On 09/24/2014 05:48 PM, Abhijit Menon-Sen wrote:

Updated patches attached.


Thanks, applied some version of these.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread Gregory Smith


On 9/25/14, 1:41 AM, David Johnston wrote:
If the error message is written correctly most people upon seeing the 
error will simply fix their configuration and move on - regardless of 
whether they were proactive in doing so having read the release notes.


The important part to realize here is that most people will never see 
such an error message.  There is a person/process who breaks the 
postgresql.conf, a process that asks for a configuration restart/reload 
(probably via pg_ctl, and then the postmaster program process creating a 
server log entry that shows the error (maybe in pgstartup.log, maybe in 
pg_log, maybe in syslog, maybe in the Windows Event Log)


In practice, the top of that food chain never knows what's happening at 
the bottom unless something goes so seriously wrong the system is down, 
and they are forced to drill down into all of these log destinations.  
That's why a subset of us consider any error message based approaches to 
GUC guidance a complete waste of time.  I won't even address the rest of 
your comments; you're focusing on trivia around something that just 
fundamentally isn't useful at all.


My challenge to anyone who things error checking has value for this 
issue is to demonstrate how that would play out usefully on a mainstream 
Postgres system like RHEL/CentOS, Debian, or even Windows  Put your 
bogus setting in the config file, activate that config file in a 
Postgres that looks for the round errors people dislike, and show me how 
that mistake is made apparent to the user who made it.  I've done 
similar exercises myself, and my guess is that if the system is up at 
all, those error messages went by completely unnoticed.


--
Greg Smith greg.sm...@crunchydatasolutions.com
Chief PostgreSQL Evangelist - http://crunchydatasolutions.com/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-09-25 Thread Heikki Linnakangas


On 09/16/2014 01:20 PM, David Rowley wrote:

+   /*
+* We mustn't allow any joins to be removed if there are any pending
+* foreign key triggers in the queue. This could happen if we are 
planning
+* a query that has been executed from within a volatile function and 
the
+* query which called this volatile function has made some changes to a
+* table referenced by a foreign key. The reason for this is that any
+* updates to a table which is referenced by a foreign key constraint 
will
+* only have the referencing tables updated after the command is 
complete,
+* so there is a window of time where records may violate the foreign 
key
+* constraint.
+*
+* Currently this code is quite naive, as we won't even attempt to 
remove
+* the join if there are *any* pending foreign key triggers, on any
+* relation. It may be worthwhile to improve this to check if there's 
any
+* pending triggers for the referencing relation in the join.
+*/
+   if (!AfterTriggerQueueIsEmpty())
+   return false;


Hmm. This code runs when the query is planned. There is no guarantee 
that there won't be after triggers pending when the query is later 
*executed*.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 1:27 AM, Gregory Smith gregsmithpg...@gmail.com wrote:
 On 9/24/14, 10:10 PM, Robert Haas wrote:
 I think you're making a mountain out of a molehill. I implemented this
 today in about three hours.

 I think you're greatly understating your own efficiency at shift/reducing
 parser mountains down to molehills.  Fabien even guessed the LOC size of the
 resulting patch with less than a 9% error.  That's some top notch software
 metrics and development work there boys; kudos all around.

Well, I blame Heikki.  I used to think that this kind of thing was
really hard, and a few years ago I might have had Fabien's reaction,
but then I saw Heikki bust out a shift/reduce parser for the isolation
tester in no time, so I decided it must not be that hard.  So it
proved.  I copied all that hard parts from other parts of the
PostgreSQL code base - my references were the isolation tester lexer
and parser, the contrib/seg parser, and the main parser.  I couldn't
do it that fast from scratch, not even close, but adapting something
that's already known to work is much easier.

 Let's get this operator support whipped into shape, then we can add the 2 to
 3 versions of the modulo operator needed to make the major use cases work.
 (There was never going to be just one hacked in with a quick patch that
 satisfied the multiple ways you can do this)

I don't think adding more versions of the modulo operator is the right
way forward: I think we should add ? : for conditionals and some kind
of function thing like abs(x).  Or maybe come up with a more
sophisticated rehashing algorithm and expose that directly as hash(x).
That's my whole reason for not wanting to adopt Fabien's approach in
the first place: I was cool with exposing C's modulo operator, but any
other modulo semantics seem like they should be built up from
general-purpose primitives.

Anyway, I think the first thing is that somebody needs to spend some
time testing, polishing, and documenting this patch, before we start
adding to it.  I'm hoping someone else will volunteer - other tasks
beckon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sloppy thinking about leakproof properties of opclass co-members

2014-09-25 Thread Robert Haas

On Wed, Sep 24, 2014 at 5:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 It strikes me that there's a significant gap in the whole leakproof
 function business, namely that no consideration has been given to
 planner-driven transformations of queries.  As an example, if we
 have a = b and b = c, the planner may generate and apply a = c
 instead of one or both of those clauses.  If a, b, c are not all the
 same type, a = c might involve an operator that's not either of the
 original ones.  And it's possible that that operator is not leakproof
 where the original ones are.  This could easily result in introducing
 non-leakproof operations into a secure subquery after pushdown of a
 clause that was marked secure.

 Another example is that in attempting to make implication or refutation
 proofs involving operator clauses, the planner feels free to apply other
 members of the operator's btree opclass (if it's in one).  I've not
 bothered to try to create a working exploit, but I'm pretty sure that
 this could result in a non-leakproof function being applied during
 planning of a supposedly secure subquery.  It might be that that couldn't
 leak anything worse than constant values within the query tree, but
 perhaps it could leak data values from a protected table's pg_statistic
 entries.

 ISTM that the most appropriate solution here is to insist that all or none
 of the members of an operator class be marked leakproof.  (Possibly we
 could restrict that to btree opclasses, but I'm not sure any exception is
 needed for other index types.)  I looked for existing violations of this
 precept, and unfortunately found a *lot*.  For example, texteq is marked
 leakproof but its fellow text comparison operators aren't.  Is that really
 sane?

Not really.  Fortunately, AFAICT, most if not all of these are in the
good direction: there are some things not marked leakproof that can be
so marked.  The reverse direction would be a hard-to-fix security
hole.  I think at some point somebody went through and tried to mark
all of the same-type equality operators as leakproof, and it seems
like that got expanded somewhat without fully rationalizing what we
had in pg_proc... mostly because I think nobody had a clear idea how
to do that.

I think your proposal here is a good one.  Heikki proposed treating
opclass functions as leakproof *in lieu of* adding a flag, but that
didn't seem good because we wanted to allow for the possibility of
cases where that wasn't true, and ensure individual scrutiny of each
case.  Your idea of making sure the flag is set consistently
throughout the opclass (opfamily?) is similar in spirit, but better in
detail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] add modulo (%) operator to pgbench

2014-09-25 Thread Gregory Smith


On 9/25/14, 8:38 AM, Robert Haas wrote:
That's my whole reason for not wanting to adopt Fabien's approach in 
the first place: I was cool with exposing C's modulo operator, but any 
other modulo semantics seem like they should be built up from 
general-purpose primitives. 


Maybe; I don't quite understand his requirements well enough yet to know 
if that's possible, or if it's easier to give him a full special 
operator of his own.  But since what you did makes that easier, too, 
forward progress regardless.


Anyway, I think the first thing is that somebody needs to spend some 
time testing, polishing, and documenting this patch, before we start 
adding to it. I'm hoping someone else will volunteer - other tasks 
beckon. 


I bouncing it to here for you, and I expect to help with those parts 
presumably in addition to Fabien's help: 
https://commitfest.postgresql.org/action/patch_view?id=1581



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Spinlocks and compiler/memory barriers

2014-09-25 Thread Robert Haas

On Wed, Sep 24, 2014 at 2:45 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-09 17:54:03 -0400, Robert Haas wrote:
 So, that's committed, then. I think we should pick something that uses
 spinlocks and is likely to fail spectacularly if we haven't got this
 totally right yet, and de-volatilize it.  And then watch to see what
 turns red in the buildfarm and/or which users start screaming.  I'm
 inclined to propose lwlock.c as a candidate, since that's very widely
 used and a place where we know there's significant contention.

 Did you consider removing the volatiles from bufmgr.c? There's lots of
 volatiles in there and most of them don't seem to have been added in a
 principled way. I'm looking at my old patch for lockless pin/unpin of
 buffers and it'd look a lot cleaner without.

I hadn't thought of it, but it sounds like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread David Johnston

On Thursday, September 25, 2014, Gregory Smith gregsmithpg...@gmail.com
wrote:

 On 9/25/14, 1:41 AM, David Johnston wrote:

 If the error message is written correctly most people upon seeing the
 error will simply fix their configuration and move on - regardless of
 whether they were proactive in doing so having read the release notes.


 The important part to realize here is that most people will never see such
 an error message.  There is a person/process who breaks the
 postgresql.conf, a process that asks for a configuration restart/reload
 (probably via pg_ctl, and then the postmaster program process creating a
 server log entry that shows the error (maybe in pgstartup.log, maybe in
 pg_log, maybe in syslog, maybe in the Windows Event Log)

 In practice, the top of that food chain never knows what's happening at
 the bottom unless something goes so seriously wrong the system is down,
 [...]


And if the GUC setting here is wrong the system will be down, right?
Otherwise the attempt at changing the setting will fail and so even if the
message itself is not seen the desired behavior of the system will remain
as it was - which is just as valid a decision rounding to zero or 1.

Just like we don't take responsibility for people not reading release notes
or checking their configuration if the DBA is not aware of where the GUCs
are being set and the logging destinations that not our problem.

David J.

[HACKERS] Inefficient barriers on solaris with sun cc

2014-09-25 Thread Andres Freund

Hi,

Binaries compiled on solaris using sun studio cc currently don't have
compiler and memory barriers implemented. That means we fall back to
relatively slow generic implementations for those. Especially compiler,
read, write barriers will be much slower than necessary (since they all
just need to prevent compiler reordering as both sparc and x86 are run
in TSO mode under solaris).

Since my estimate is that we'll use more and more barriers, that's going
to hurt more and more.

I do *not* plan to do anything about it atm, I just thought it might be
helpful to have this stated somewhere searchable.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

2014-09-25 Thread Thom Brown

On 19 September 2014 17:54, Stephen Frost sfr...@snowman.net wrote:

 Thom,

 * Thom Brown (t...@linux.com) wrote:
  On 19 September 2014 17:32, Stephen Frost sfr...@snowman.net wrote:
   * Thom Brown (t...@linux.com) wrote:
On 14 September 2014 16:38, Stephen Frost sfr...@snowman.net wrote:
# create policy visible_colours on colours for all to joe using (visible
   =
true);
CREATE POLICY
   [...]
 insert into colours (name, visible) values ('transparent',false);
ERROR:  new row violates WITH CHECK OPTION for colours
DETAIL:  Failing row contains (7, transparent, f).
   
 select * from pg_policies ;
   policyname| tablename | roles | cmd |   qual   |
   with_check
   
   -+---+---+-+--+
 visible_colours | colours   | {joe} | ALL | (visible = true) |
(1 row)
   
There was no WITH CHECK OPTION.
  
   As I hope is clear if you look at the documentation- if the WITH CHECK
   clause is omitted, then the USING clause is used for both filtering and
   checking new records, otherwise you'd be able to add records which
   aren't visible to you.
 
  I can see that now, although I do find the error message somewhat
  confusing.  Firstly, it looks like OPTION is part of the parameter name,
  which it isn't.

 Hmm, the notion of 'with check option' is from the SQL standard, which
 is why I felt the error message was appropriate as-is..

  Also, I seem to get an error message with the following:
 
  # create policy nice_colours ON colours for all to joe using (visible =
  true) with check (name in ('blue','green','yellow'));
  CREATE POLICY
 
  \c - joe
 
   insert into colours (name, visible) values ('blue',false);
  ERROR:  function with OID 0 does not exist

 Now *that* one is interesting and I'll definitely go take a look at it.
 We added quite a few regression tests to try and make sure these things
 work.

  And if this did work, but I only violated the USING clause, would this
  still say the WITH CHECK clause was the cause?

 WITH CHECK applies for INSERT and UPDATE for the new records going into
 the table.  You can't actually violate the USING clause for an INSERT
 as USING is for filtering records, not checking that records being added
 to the table are valid.

 To try and clarify- by explicitly setting both USING and WITH CHECK, you
 *are* able to INSERT records which are not visible to you.  We felt that
 was an important capability to support.

I find it a bit of a limitation that I can't specify both INSERT and
UPDATE for a policy.  I'd want to be able to specify something like
this:

CREATE POLICY no_greys_allowed
  ON colours
  FOR INSERT, UPDATE
  WITH CHECK (name NOT IN ('grey','gray'));

I would expect this to be rather common to prevent certain values
making their way into a table.  Instead I'd have to create 2 policies
as it stands.



In order to debug issues with accessing table data, perhaps it would
be useful to output the name of the policy that was violated.  If a
table had 20 policies on, it could become time-consuming to debug.



I keep getting tripped up by overlapping policies.  On the one hand, I
created a policy to ensure rows being added or selected have a
visible column set to true.  On the other hand, I have a policy that
ensures that the name of a colour doesn't appear in a list.  Policy 1
is violated until policy 2 is added:

(using the table I created in a previous post on this thread...)

# create policy must_be_visible ON colours for all to joe using
(visible = true) with check (visible = true);
CREATE POLICY

\c - joe

 insert into colours (name, visible) values ('pink',false);
ERROR:  new row violates WITH CHECK OPTION for colours
DETAIL:  Failing row contains (28, pink, f).

\c - thom

# create policy no_greys_allowed on colours for insert with check
(name not in ('grey','gray'));
CREATE POLICY

\c - joe

# insert into colours (name, visible) values ('pink',false);
INSERT 0 1

I expected this to still trigger an error due to the first policy.  Am
I to infer from this that the policy model is permissive rather than
restrictive?


I've also attached a few corrections for the docs.

Thom
diff --git a/doc/src/sgml/ref/alter_policy.sgml 
b/doc/src/sgml/ref/alter_policy.sgml
index 37615fc..ab717f3 100644
--- a/doc/src/sgml/ref/alter_policy.sgml
+++ b/doc/src/sgml/ref/alter_policy.sgml
@@ -94,7 +94,7 @@ ALTER POLICY replaceable 
class=parametername/replaceable ON replaceable c
   security-barrier qualification to queries which use the table
   automatically.  If multiple policies are being applied for a given
   table then they are all combined and added using OR.  The USING
-  expression applies to records which are being retrived from the table.
+  expression applies to records which are being retrieved from the table.
  /para
 /listitem
/varlistentry
diff --git a/doc/src/sgml/ref/create_policy.sgml

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2014-09-25 Thread Simon Riggs

On 25 September 2014 10:41, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote:
 At 2014-09-24 11:09:24 +0200, and...@2ndquadrant.com wrote:
  I think it's completely unacceptable to copy a visibility routine.

 OK. Which visibility routine should I use, and should I try to create a
 variant that doesn't set hint bits?

 I've not yet followed your premise that you actually need one that
 doesn't set hint bits...

Not least because I'm trying to solve a similar problem on another
thread, so no need to make a special case here.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing isinf declaration on solaris

2014-09-25 Thread Andres Freund

On 2014-09-24 17:39:19 -0300, Alvaro Herrera wrote:
 Tom Lane wrote:
  Peter Eisentraut pete...@gmx.net writes:
   On 9/24/14 9:21 AM, Tom Lane wrote:
   Agreed, but what about non-GCC compilers?
  
   Stick AC_PROG_CC_C99 into configure.in.
  
  I think that's a bad idea, unless you mean to do it only on Solaris.
  If we do that unconditionally, we will pretty much stop getting any
  warnings about C99-isms on modern platforms.  I am not aware that
  there has been any agreement to move our portability goalposts up
  to C99.
 
 AFAIK we cannot move all the way to C99, because MSVC doesn't support
 it.

FWIW, msvc has supported a good part of C99 for long while. There's bits
and pieces it doesn't, but it's not things I think we're likely to
adopt. The most commonly complained about one is C99 variable
declarations. I can't see PG adopting that tomorrow.

From VS 2013 onwards they're trying hard to be C99 and C11 compatible.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-25 Thread Heikki Linnakangas


On 09/10/2014 04:35 PM, Marko Tiikkaja wrote:

On 9/10/14 1:38 PM, Heikki Linnakangas wrote:

On 09/10/2014 02:26 PM, Marko Tiikkaja wrote:

So I wonder if I shouldn't try and instead keep the
code closer to what it is in HEAD right now; I could call
enlargeStringInfo() first, then hand out a pointer to b64_encode (or
b64_decode()) and finally increment StringInfoData.len by how much was
actually written.  That would keep the code changes a lot smaller, too.


Yeah, that sounds reasonable.


OK, I've attemped to do that in the attached.  I'm pretty sure I didn't
get all of the overflows right, so someone should probably take a really
good look at it.  (I'm not too confident the original code got them
right either, but whatever).


Looks good, committed. It might've been a tad more efficient to return 
the StringInfo buffer directly from pgp_armor/dearmor, and avoid the 
extra palloc and memcpy, but this isn't performance critical enough for 
it to really matter.


Are you planning to post the main patch rebased on top of this soon? As 
in the next day or two? Otherwise I'll mark this as Returned with 
feedback for this commitfest.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas

On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas robertmh...@gmail.com wrote:
 The patch I attached the first time was just the last commit in the
 git repository where I wrote the patch, rather than the changes that I
 made on top of that commit.  So, yes, the results from the previous
 message are with the patch attached to the follow-up.  I just typed
 the wrong git command when attempting to extract that patch to attach
 it to the email.

Here are some more results.  TL;DR: The patch still looks good, but we
should raise the number of buffer mapping partitions as well.

On the IBM POWER7 machine, I ran read-only pgbench tests with 1
client, 8 clients, and all multiples of 16 up to 96.  I ran these
tests at scale factor 1000 and scale factor 3000. I tested four
builds: master as of commit df4077cda2eae3eb4a5cf387da0c1e7616e73204,
that same commit with the number of buffer mapping partitions raised
to 128, that commit with reduce-replacement-locking.patch applied, and
that commit with reduce-replacement-locking.patch applied AND the
number of buffer mapping partitions raised to 128.  The results from
each configuration are reported in that order on each of the lines
below; each is the median of three results.  shared_buffers=8GB for
all tests.

scale factor 1000
1 8119.907618 8230.853237 8153.515217 8145.045004
8 65457.006762 65826.439701 65851.010116 65703.168020
16 125263.858855 125723.441853 125020.598728 129506.037997
32 176696.288187 182376.232631 178278.917581 186440.340283
48 193251.602743 214243.417591 197958.562641 226782.327868
64 182264.276909 218655.105894 190364.759052 256863.652885
80 171719.210488 203104.673512 179861.241080 274065.020956
96 162525.883898 190960.622943 169759.271356 277820.128782

scale factor 3000
1 7690.357053 7723.925932 7772.207513 7684.079850
8 60789.325087 61688.547446 61485.398967 62546.166411
16 112509.423777 115138.385501 115606.858594 120015.350112
32 147881.211900 161359.994902 153302.501020 173063.463752
48 129748.929652 153986.160920 136164.387103 204935.207578
64 114364.542340 132174.970721 116705.371890 224636.957891
80 101375.265389 117279.931095 102374.794412 232966.076908
96 93144.724830 106676.309224 92787.650325 233862.872939

Analysis:

1. To see the effect of reduce-replacement-locking.patch, compare the
first TPS number in each line to the third, or the second to the
fourth.  At scale factor 1000, the patch wins in all of the cases with
32 or more clients and exactly half of the cases with 1, 8, or 16
clients.  The variations at low client counts are quite small, and the
patch isn't expected to do much at low concurrency levels, so that's
probably just random variation.  At scale factor 3000, the situation
is more complicated.  With only 16 bufmappinglocks, the patch gets its
biggest win at 48 clients, and by 96 clients it's actually losing to
unpatched master.  But with 128 bufmappinglocks, it wins - often
massively - on everything but the single-client test, which is a small
loss, hopefully within experimental variation.

2. To see the effect of increasing the number of buffer mapping locks
to 128, compare the first TPS number in each line to the second, or
the third to the fourth.  Without reduce-replacement-locking.patch,
that's a win at every concurrency level at both scale factors.  With
that patch, the 1 and 8 client tests are small losses at scale factor
1000, and the 1 client test is a small loss at scale factor 3000.

The single-client results, which are often a concern for scalability
patches, bear a bit of further comment.  In this round of testing,
either patch alone improved things slightly, and both patches together
made them slightly worse.  Even if that is reproducible, I don't think
it should be cause for concern, because it tends to indicate (at least
to me) that the shifting around is just the result of slightly
different placement of code across cache lines, or other minor factors
we can't really pin down.  So I'm inclined to (a) push
reduce-replacement-locking.patch and then also (b) bump up the number
of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
accordingly so that pg_buffercache doesn't get unhappy).

Comments?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-25 Thread Marko Tiikkaja


On 9/25/14 3:50 PM, Heikki Linnakangas wrote:

On 09/10/2014 04:35 PM, Marko Tiikkaja wrote:

OK, I've attemped to do that in the attached.  I'm pretty sure I didn't
get all of the overflows right, so someone should probably take a really
good look at it.  (I'm not too confident the original code got them
right either, but whatever).


Looks good, committed.


Thanks!


It might've been a tad more efficient to return
the StringInfo buffer directly from pgp_armor/dearmor, and avoid the
extra palloc and memcpy, but this isn't performance critical enough for
it to really matter.


I couldn't see any way of doing that without breaking the VARDATA 
abstraction.  I even went looking for similar cases in the source code, 
but couldn't find any.  If you can come up with a way, feel free to 
change that -- I'd like to learn myself.


But like you, I didn't consider it too important.


Are you planning to post the main patch rebased on top of this soon? As
in the next day or two? Otherwise I'll mark this as Returned with
feedback for this commitfest.


Yes.  With good luck I'll get you a rebased one today, otherwise it'll 
have to wait until tomorrow.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing isinf declaration on solaris

2014-09-25 Thread Oskari Saarenmaa


24.09.2014, 23:26, Tom Lane kirjoitti:

Peter Eisentraut pete...@gmx.net writes:

On 9/24/14 9:21 AM, Tom Lane wrote:

Agreed, but what about non-GCC compilers?



Stick AC_PROG_CC_C99 into configure.in.


I think that's a bad idea, unless you mean to do it only on Solaris.
If we do that unconditionally, we will pretty much stop getting any
warnings about C99-isms on modern platforms.  I am not aware that
there has been any agreement to move our portability goalposts up
to C99.


We don't currently try to select a C89 mode in configure.in, we just use 
the default mode which may be C89 or C99 or something in between.


GCC docs used to say that once C99 support is complete it'll switch 
defaults from gnu90 to gnu99, now the latest docs say that the default 
will change to gnu11 at some point 
(https://gcc.gnu.org/onlinedocs/gcc/Standards.html).  Solaris Studio 
already defaults to C99 and it looks like the latest versions of MSVC 
also support it.


I think we should just enable C99 mode when possible to use the 
backwards compatible features of it (like isinf).  If C89 support is 
still needed we should set up a new buildfarm animal that really uses a 
C89 mode compiler and makes sure it compiles without warnings.


/ Oskari


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing isinf declaration on solaris

2014-09-25 Thread Alvaro Herrera

Andres Freund wrote:
 On 2014-09-24 17:39:19 -0300, Alvaro Herrera wrote:

  AFAIK we cannot move all the way to C99, because MSVC doesn't support
  it.
 
 FWIW, msvc has supported a good part of C99 for long while. There's bits
 and pieces it doesn't, but it's not things I think we're likely to
 adopt. The most commonly complained about one is C99 variable
 declarations. I can't see PG adopting that tomorrow.

I got unlucky that I got bitten by precisely that missing feature twice,
I guess.

 From VS 2013 onwards they're trying hard to be C99 and C11 compatible.

Sounds great.  Is VS2013 released already?  If so, maybe we can think
about moving to C99 in 2016 or so; at least assuming you can build for
older Windows versions with the new compiler.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a fast bloat measurement tool (was Re: Measuring relation free space)

2014-09-25 Thread Andres Freund

On 2014-09-25 14:43:14 +0100, Simon Riggs wrote:
 On 25 September 2014 10:41, Andres Freund and...@2ndquadrant.com wrote:
  On 2014-09-25 10:24:39 +0530, Abhijit Menon-Sen wrote:
  At 2014-09-24 11:09:24 +0200, and...@2ndquadrant.com wrote:
   I think it's completely unacceptable to copy a visibility routine.
 
  OK. Which visibility routine should I use, and should I try to create a
  variant that doesn't set hint bits?
 
  I've not yet followed your premise that you actually need one that
  doesn't set hint bits...
 
 Not least because I'm trying to solve a similar problem on another
 thread, so no need to make a special case here.

That's mostly unrelated though - Abhijit wants to avoid them because he
tried to avoid having *any* form of lock on the buffer. That's the
reason he tried avoid hint bit setting. Since I don't believe that's
safe (at least there's by far not enough evidence about it), there's
simply no reason to avoid it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] missing isinf declaration on solaris

2014-09-25 Thread Andres Freund

On 2014-09-25 10:56:56 -0300, Alvaro Herrera wrote:
 Andres Freund wrote:
  From VS 2013 onwards they're trying hard to be C99 and C11 compatible.
 
 Sounds great.  Is VS2013 released already?

Yes.

 If so, maybe we can think about moving to C99 in 2016 or so; at least
 assuming you can build for

I think we should simply pick and choose the features we want. And go
for it now. Besidesthe fact that one of them wasn't applied, I'd sent
patches a couple months back that'd allow setting up a buildfarm animal
failing for any patch going above C89 + chosen features.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Merlin Moncure

On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
 1. To see the effect of reduce-replacement-locking.patch, compare the
 first TPS number in each line to the third, or the second to the
 fourth.  At scale factor 1000, the patch wins in all of the cases with
 32 or more clients and exactly half of the cases with 1, 8, or 16
 clients.  The variations at low client counts are quite small, and the
 patch isn't expected to do much at low concurrency levels, so that's
 probably just random variation.  At scale factor 3000, the situation
 is more complicated.  With only 16 bufmappinglocks, the patch gets its
 biggest win at 48 clients, and by 96 clients it's actually losing to
 unpatched master.  But with 128 bufmappinglocks, it wins - often
 massively - on everything but the single-client test, which is a small
 loss, hopefully within experimental variation.

 Comments?

Why stop at 128 mapping locks?   Theoretical downsides to having more
mapping locks have been mentioned a few times but has this ever been
measured?  I'm starting to wonder if the # mapping locks should be
dependent on some other value, perhaps the # of shared bufffers...

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-25 Thread Heikki Linnakangas


On 09/25/2014 04:56 PM, Marko Tiikkaja wrote:

On 9/25/14 3:50 PM, Heikki Linnakangas wrote:

On 09/10/2014 04:35 PM, Marko Tiikkaja wrote:
It might've been a tad more efficient to return
the StringInfo buffer directly from pgp_armor/dearmor, and avoid the
extra palloc and memcpy, but this isn't performance critical enough for
it to really matter.


I couldn't see any way of doing that without breaking the VARDATA
abstraction.  I even went looking for similar cases in the source code,
but couldn't find any.  If you can come up with a way, feel free to
change that -- I'd like to learn myself.


You could first append VARHDRSZ zeros to the StringInfo, then append the 
base64-encoded data, and last replace the zeros with the real length, 
using SET_VARSIZE.



Are you planning to post the main patch rebased on top of this soon? As
in the next day or two? Otherwise I'll mark this as Returned with
feedback for this commitfest.


Yes.  With good luck I'll get you a rebased one today, otherwise it'll
have to wait until tomorrow.


Ok, I'll leave this in Waiting on Author state then.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 10:02 AM, Merlin Moncure mmonc...@gmail.com wrote:
 On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
 1. To see the effect of reduce-replacement-locking.patch, compare the
 first TPS number in each line to the third, or the second to the
 fourth.  At scale factor 1000, the patch wins in all of the cases with
 32 or more clients and exactly half of the cases with 1, 8, or 16
 clients.  The variations at low client counts are quite small, and the
 patch isn't expected to do much at low concurrency levels, so that's
 probably just random variation.  At scale factor 3000, the situation
 is more complicated.  With only 16 bufmappinglocks, the patch gets its
 biggest win at 48 clients, and by 96 clients it's actually losing to
 unpatched master.  But with 128 bufmappinglocks, it wins - often
 massively - on everything but the single-client test, which is a small
 loss, hopefully within experimental variation.

 Comments?

 Why stop at 128 mapping locks?   Theoretical downsides to having more
 mapping locks have been mentioned a few times but has this ever been
 measured?  I'm starting to wonder if the # mapping locks should be
 dependent on some other value, perhaps the # of shared bufffers...

Good question.  My belief is that the number of buffer mapping locks
required to avoid serious contention will be roughly proportional to
the number of hardware threads.  At the time the value 16 was chosen,
there were probably not more than 8-core CPUs in common use; but now
we've got a machine with 64 hardware threads and, what do you know but
it wants 128 locks.

I think the long-term solution here is that we need a lock-free hash
table implementation for our buffer mapping tables, because I'm pretty
sure that just cranking the number of locks up and up is going to
start to have unpleasant side effects at some point.  We may be able
to buy a few more years by just cranking it up, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgcrypto: PGP armor headers

2014-09-25 Thread Marko Tiikkaja


On 9/25/14 4:08 PM, Heikki Linnakangas wrote:

On 09/25/2014 04:56 PM, Marko Tiikkaja wrote:

On 9/25/14 3:50 PM, Heikki Linnakangas wrote:

On 09/10/2014 04:35 PM, Marko Tiikkaja wrote:
It might've been a tad more efficient to return
the StringInfo buffer directly from pgp_armor/dearmor, and avoid the
extra palloc and memcpy, but this isn't performance critical enough for
it to really matter.


I couldn't see any way of doing that without breaking the VARDATA
abstraction.  I even went looking for similar cases in the source code,
but couldn't find any.  If you can come up with a way, feel free to
change that -- I'd like to learn myself.


You could first append VARHDRSZ zeros to the StringInfo, then append the
base64-encoded data, and last replace the zeros with the real length,
using SET_VARSIZE.


That's assuming that VARDATA() is at exactly VARHDRSZ bytes.  I couldn't 
find any callers making that assumption.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Simon Riggs

On 28 August 2014 03:43, Peter Geoghegan p...@heroku.com wrote:

 The patch currently lacks a way of referencing datums rejected for
 insertion when updating. The way MySQL handles the issue seems
 questionable. They allow you to do something like this:

 INSERT INTO upsert (key, val) VALUES (1 'val') ON DUPLICATE KEY UPDATE
 val = VALUES(val);

 The implication is that the updated value comes from the INSERT's
 VALUES() list, but emulating that seems like a bad idea. In general,
 at least with Postgres it's entirely possible that values rejected
 differ from the values appearing in the VALUES() list, due to the
 effects of before triggers. I'm not sure whether or not we should
 assume equivalent transformations during any UPDATE before triggers.

 This is an open item. I think it makes sense to deal with it a bit later.

IMHO it is impossible to know if any of the other code is correct
until we have a clear and stable vision of what the command is
supposed to perform.

The inner workings are less important than what the feature does.

FWIW, the row available at the end of all BEFORE triggers is clearly
the object we should be manipulating, not the original VALUES()
clause. Otherwise this type of INSERT would behave differently from
normal INSERTs. Which would likely violate RLS, if nothing else.


 As I mentioned, I have incorporated feedback from Kevin Grittner. You
 may specify a unique index to merge on from within the INSERT
 statement, thus avoiding the risk of inadvertently having the update
 affect the wrong tuple due to the user failing to consider that there
 was a would-be unique violation within some other unique index
 constraining some other attribute. You may write the DML statement
 like this:

 INSERT INTO upsert(key, val) VALUES(1, 'insert') ON CONFLICT WITHIN
 upsert_pkey UPDATE SET val = 'update';

 I think that there is a good chance that at least some people will
 want to make this mandatory. I guess that's fair enough, but I
 *really* don't want to *mandate* that users specify the name of their
 unique index in DML for obvious reasons. Perhaps we can come up with a
 more tasteful syntax that covers all interesting cases (consider the
 issues with partial unique indexes and before triggers for example,
 where a conclusion reached about which index to use during parse
 analysis may subsequently be invalidated by user-defined code, or
 ambiguous specifications in the face of overlapping attributes between
 two unique composite indexes, etc). The Right Thing is far from
 obvious, and there is very little to garner from other systems, since
 SQL MERGE promises essentially nothing about concurrency, both as
 specified by the standard and in practice. You don't need a unique
 index at all, and as I showed in my pgCon talk, there are race
 conditions even for a trivial UPSERT operations in all major SQL MERGE
 implementations.

Surely if there are multiple unique indexes then the result row must
be validated against all unique indexes before it is allowed at all?

The only problem I see is if the newly inserted row matches one row on
one unique value and a different row on a different unique index.
Turning the INSERT into an UPDATE will still fail on one or other, no
matter which index you pick. If there is one row for ALL unique
indexes then it is irrelevant which index you pick. So either way, I
cannot see a reason to specify an index.

If we do need such a construct, we have already the concept of an
IDENTITY for a table, added in 9.4, currently targeted at replication.
Listing indexes or columns in the DML statement is more pushups for
developers and ORMs, so lets KISS.


The way forwards, in my view, is to define precisely the behaviour we
wish to have. That definition will include the best current mechanism
for running an UPSERT using INSERT/UPDATE/loops and comparing that
against what is being provided here. We will then have a functional
test of equivalence of the approaches, and a basis for making a
performance test that shows that performance is increased without any
loss of concurrency.

Once we have that, we can then be certain our time spent on internals
is not wasted by overlooking a simple userland gotcha.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inefficient barriers on solaris with sun cc

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 9:34 AM, Andres Freund and...@2ndquadrant.com wrote:
 Binaries compiled on solaris using sun studio cc currently don't have
 compiler and memory barriers implemented. That means we fall back to
 relatively slow generic implementations for those. Especially compiler,
 read, write barriers will be much slower than necessary (since they all
 just need to prevent compiler reordering as both sparc and x86 are run
 in TSO mode under solaris).

 Since my estimate is that we'll use more and more barriers, that's going
 to hurt more and more.

 I do *not* plan to do anything about it atm, I just thought it might be
 helpful to have this stated somewhere searchable.

To put that another way:

If there are any Sun Studio users out there who care about performance
on big iron, please send a patch to fix this...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund

On 2014-09-25 09:51:17 -0400, Robert Haas wrote:
 On Tue, Sep 23, 2014 at 5:50 PM, Robert Haas robertmh...@gmail.com wrote:
  The patch I attached the first time was just the last commit in the
  git repository where I wrote the patch, rather than the changes that I
  made on top of that commit.  So, yes, the results from the previous
  message are with the patch attached to the follow-up.  I just typed
  the wrong git command when attempting to extract that patch to attach
  it to the email.
 
 Here are some more results.  TL;DR: The patch still looks good, but we
 should raise the number of buffer mapping partitions as well.

  So I'm inclined to (a) push
 reduce-replacement-locking.patch and then also (b) bump up the number
 of buffer mapping locks to 128 (and increase MAX_SIMUL_LWLOCKS
 accordingly so that pg_buffercache doesn't get unhappy).

I'm happy with that. I don't think it's likely that a moderate increase
in the number of mapping lwlocks will be noticeably bad for any
workload.

One difference is that the total number of lwlock acquirations will be a
bit higher because currently it's more likely for the old and new to
fall into different partitions. But that's not really significant.

The other difference is the number of cachelines touched. Currently, in
concurrent workloads, there's already lots of L1 cache misses around the
buffer mapping locks because they're exclusively owned by a different
core/socket. So, to be effectively worse, the increase would need to
lead to lower overall cache hit rates by them not being in *any* cache
or displacing other content.

That leads me to wonder: Have you measured different, lower, number of
buffer mapping locks? 128 locks is, if we'd as we should align them
properly, 8KB of memory. Common L1 cache sizes are around 32k...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund

On 2014-09-25 09:02:25 -0500, Merlin Moncure wrote:
 On Thu, Sep 25, 2014 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
  1. To see the effect of reduce-replacement-locking.patch, compare the
  first TPS number in each line to the third, or the second to the
  fourth.  At scale factor 1000, the patch wins in all of the cases with
  32 or more clients and exactly half of the cases with 1, 8, or 16
  clients.  The variations at low client counts are quite small, and the
  patch isn't expected to do much at low concurrency levels, so that's
  probably just random variation.  At scale factor 3000, the situation
  is more complicated.  With only 16 bufmappinglocks, the patch gets its
  biggest win at 48 clients, and by 96 clients it's actually losing to
  unpatched master.  But with 128 bufmappinglocks, it wins - often
  massively - on everything but the single-client test, which is a small
  loss, hopefully within experimental variation.
 
  Comments?
 
 Why stop at 128 mapping locks?   Theoretical downsides to having more
 mapping locks have been mentioned a few times but has this ever been
 measured?  I'm starting to wonder if the # mapping locks should be
 dependent on some other value, perhaps the # of shared bufffers...

Wrong way round. You need to prove the upside of increasing it further,
not the contrary. The primary downside is cache hit ratio and displacing
other cache entries...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 That leads me to wonder: Have you measured different, lower, number of
 buffer mapping locks? 128 locks is, if we'd as we should align them
 properly, 8KB of memory. Common L1 cache sizes are around 32k...

Amit has some results upthread showing 64 being good, but not as good
as 128.  I haven't verified that myself, but have no reason to doubt
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund

On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
 On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  That leads me to wonder: Have you measured different, lower, number of
  buffer mapping locks? 128 locks is, if we'd as we should align them
  properly, 8KB of memory. Common L1 cache sizes are around 32k...
 
 Amit has some results upthread showing 64 being good, but not as good
 as 128.  I haven't verified that myself, but have no reason to doubt
 it.

How about you push the spinlock change and I crosscheck the partition
number on a multi socket x86 machine? Seems worthwile to make sure that
it doesn't cause problems on x86. I seriously doubt it'll, but ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

2014-09-25 Thread Stephen Frost

Thom,

* Thom Brown (t...@linux.com) wrote:
 I find it a bit of a limitation that I can't specify both INSERT and
 UPDATE for a policy.  I'd want to be able to specify something like
 this:
 
 CREATE POLICY no_greys_allowed
   ON colours
   FOR INSERT, UPDATE
   WITH CHECK (name NOT IN ('grey','gray'));
 
 I would expect this to be rather common to prevent certain values
 making their way into a table.  Instead I'd have to create 2 policies
 as it stands.

That's not actually the case...

CREATE POLICY no_greys_allowed
  ON colours
  FOR ALL
  USING (true) -- assuming this is what you intended
  WITH CHECK (name NOT IN ('grey','gray'));

Right?  That said, I'm not against the idea of supporting mulitple
commands with one policy (similar to how ALL is done).  It wouldn't be
difficult or much of a change- make the 'cmd' a bitfield instead.  If
others feel the same then I'll look at doing that.

 In order to debug issues with accessing table data, perhaps it would
 be useful to output the name of the policy that was violated.  If a
 table had 20 policies on, it could become time-consuming to debug.

Good point.  That'll involve a bit more as I'll need to look at the
existing with check options structure, but I believe it's just adding
the field to the structure, populating it when adding the WCO entries,
and then checking for it in the ereport() call.  The policy name is
already stashed in the relcache entry, so it's already pretty easily
available.

 I keep getting tripped up by overlapping policies.  On the one hand, I
 created a policy to ensure rows being added or selected have a
 visible column set to true.  On the other hand, I have a policy that
 ensures that the name of a colour doesn't appear in a list.  Policy 1
 is violated until policy 2 is added:
 
 (using the table I created in a previous post on this thread...)
 
 # create policy must_be_visible ON colours for all to joe using
 (visible = true) with check (visible = true);
 CREATE POLICY
 
 \c - joe
 
  insert into colours (name, visible) values ('pink',false);
 ERROR:  new row violates WITH CHECK OPTION for colours
 DETAIL:  Failing row contains (28, pink, f).
 
 \c - thom
 
 # create policy no_greys_allowed on colours for insert with check
 (name not in ('grey','gray'));
 CREATE POLICY
 
 \c - joe
 
 # insert into colours (name, visible) values ('pink',false);
 INSERT 0 1
 
 I expected this to still trigger an error due to the first policy.  Am
 I to infer from this that the policy model is permissive rather than
 restrictive?

That's correct and I believe pretty clear in the documentation- policies
are OR'd together, just the same as how roles are handled.  As a
logged-in user, you have the rights of all of the roles you are a member
of (subject to inheiritance rules, of course), and similairly, you are
able to view and add all rows which match any policy which applies to
you (either through role membership or through different policies).

 I've also attached a few corrections for the docs.

Thanks!  I'll plan to include these with a few other typos and the fix
for the bug that Andres pointed out, once I finish testing (and doing
another CLOBBER_CACHE_ALWAYS run..).

Thanks again,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Index scan optimization

2014-09-25 Thread Simon Riggs

On 22 September 2014 14:46, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 09/22/2014 04:45 PM, Tom Lane wrote:

 Heikki Linnakangas hlinnakan...@vmware.com writes:

 On 09/22/2014 07:47 AM, Rajeev rastogi wrote:

 So my proposal is to skip the condition check on the first scan key
 condition for every tuple.


 The same happens in a single-column case. If you have a query like
 SELECT * FROM tbl2 where id2  'a', once you've found the start
 position of the scan, you know that all the rows that follow match too.


 ... unless you're doing a backwards scan.


 Sure. And you have to still check for NULLs. Have to get the details right..

And also that the tests don't use volatile functions.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 10:12 AM, Simon Riggs si...@2ndquadrant.com wrote:
 IMHO it is impossible to know if any of the other code is correct
 until we have a clear and stable vision of what the command is
 supposed to perform.

+1.

 The inner workings are less important than what the feature does.

+1.

 FWIW, the row available at the end of all BEFORE triggers is clearly
 the object we should be manipulating, not the original VALUES()
 clause. Otherwise this type of INSERT would behave differently from
 normal INSERTs. Which would likely violate RLS, if nothing else.

+1.

 Surely if there are multiple unique indexes then the result row must
 be validated against all unique indexes before it is allowed at all?

 The only problem I see is if the newly inserted row matches one row on
 one unique value and a different row on a different unique index.
 Turning the INSERT into an UPDATE will still fail on one or other, no
 matter which index you pick. If there is one row for ALL unique
 indexes then it is irrelevant which index you pick. So either way, I
 cannot see a reason to specify an index.

Failure could be the right thing in some cases.  For example, imagine
that a user has a table containing names, email addresses, and (with
apologies for the American-ism, but I don't know what would be
comparable elsewhere) social security numbers.  The user has unique
indexes on both email addresses and SSNs.  If a new record arrives for
the same email address, they want to replace the existing record; but
a new record arrives with the same SSN, they want the transaction to
fail.  Otherwise, a newly-arrived record might overwrite the email
address of an existing record, which they never want to do, because
they view email address as the primary key.

I think this kind of scenario will actually be pretty common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund

On 2014-09-25 10:09:30 -0400, Robert Haas wrote:
 I think the long-term solution here is that we need a lock-free hash
 table implementation for our buffer mapping tables, because I'm pretty
 sure that just cranking the number of locks up and up is going to
 start to have unpleasant side effects at some point.  We may be able
 to buy a few more years by just cranking it up, though.

I think mid to long term we actually need something else than a
hashtable. Capable of efficiently looking for the existance of
'neighboring' buffers so we can intelligently prefetch far enough that
the read actually completes when we get there. Also I'm pretty sure that
we'll need a way to efficiently remove all buffers for a relfilenode
from shared buffers - linearly scanning for that isn't a good
solution. So I think we need a different data structure.

I've played a bit around with just replacing buf_table.c with a custom
handrolled hashtable because I've seen more than one production workload
where hash_search_with_hash_value() is both cpu and cache miss wise
top#1 of profiles. With most calls coming from the buffer mapping and
then from the lock manager.

There's two reasons for that: a) dynahash just isn't very good and it
does a lot of things that will never be necessary for these hashes. b)
the key into the hash table is *far* too wide. A significant portion of
the time is spent comparing buffer/lock tags.

The aforementioned replacement hash table was a good bit faster for
fully cached workloads - but at the time I wrote I could still make it
crash in very high cache pressure workloads, so that should be taken
with a fair bit of salt.

I think we can comparatively easily get rid of the tablespace in buffer
tags. Getting rid of the database already would be a fair bit harder. I
haven't really managed to get an idea how to remove the fork number
without making the catalog much more complicated.  I don't think we can
go too long without at least some of these steps :(.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Merlin Moncure

On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote:
 Why stop at 128 mapping locks?   Theoretical downsides to having more
 mapping locks have been mentioned a few times but has this ever been
 measured?  I'm starting to wonder if the # mapping locks should be
 dependent on some other value, perhaps the # of shared bufffers...

 Wrong way round. You need to prove the upside of increasing it further,
 not the contrary. The primary downside is cache hit ratio and displacing
 other cache entries...

I can't do that because I don't have the hardware.  I wasn't
suggesting to just set it but to measure the affects of setting it.
But the benefits from going from 16 to 128 are pretty significant at
least on this hardware; I'm curious how much further it can be
pushed...what's wrong with trying it out?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 10:24 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-25 10:22:47 -0400, Robert Haas wrote:
 On Thu, Sep 25, 2014 at 10:14 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  That leads me to wonder: Have you measured different, lower, number of
  buffer mapping locks? 128 locks is, if we'd as we should align them
  properly, 8KB of memory. Common L1 cache sizes are around 32k...

 Amit has some results upthread showing 64 being good, but not as good
 as 128.  I haven't verified that myself, but have no reason to doubt
 it.

 How about you push the spinlock change and I crosscheck the partition
 number on a multi socket x86 machine? Seems worthwile to make sure that
 it doesn't cause problems on x86. I seriously doubt it'll, but ...

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Scaling shared buffer eviction

2014-09-25 Thread Andres Freund

On 2014-09-25 09:34:57 -0500, Merlin Moncure wrote:
 On Thu, Sep 25, 2014 at 9:14 AM, Andres Freund and...@2ndquadrant.com wrote:
  Why stop at 128 mapping locks?   Theoretical downsides to having more
  mapping locks have been mentioned a few times but has this ever been
  measured?  I'm starting to wonder if the # mapping locks should be
  dependent on some other value, perhaps the # of shared bufffers...
 
  Wrong way round. You need to prove the upside of increasing it further,
  not the contrary. The primary downside is cache hit ratio and displacing
  other cache entries...
 
 I can't do that because I don't have the hardware.

One interesting part of this is making sure it doesn't regress
older/smaller machines. So at least that side you could check...

 what's wrong with trying it out?

If somebody is willing to do it: nothing. I'd just much rather do the,
by now proven, simple change before starting with more complex
solutions.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 12:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 TBH I've also been wondering whether any of these proposed cures are
 better than the disease.

I couldn't agree more.  There's something to be said for just leaving
this alone.

 The changes that can be argued to make the
 behavior more sane are also ones that introduce backwards compatibility
 issues of one magnitude or another.

But on this point I think David Johnston said it best:

# Any change has the potential to draw complaints.  For you it seems that hey,
# I upgraded to 9.5 and my logs are being rotated out every minute now.  I
# thought I had that turned off is the desired complaint.  Greg wants: hey, my
# 1 hour log rotation is now happening every minute.  If the error message is
# written correctly most people upon seeing the error will simply fix their
# configuration and move on - regardless of whether they were proactive in
# doing so having read the release notes.

I particularly agree with his first sentence - any change can
potentitally draw complaints.  But I also agree with his last one - of
those three possible complaints, I certainly prefer I had to fix my
configuration file for the new, stricter validation over any variant
of my configuration file still worked but it did something
surprisingly different from what it used to do..

YMMV.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Robert Haas

On Wed, Sep 24, 2014 at 4:36 PM, Simon Riggs si...@2ndquadrant.com wrote:
 To me, being able to say pg_ctl promote_right_now -m yes_i_mean_it
 seems like a friendlier interface than making somebody shut down the
 server, run pg_resetxlog, and start it up again.

 It makes sense to go from paused -- promoted.

Well, we could certainly require that people call
pg_xlog_replay_pause() before they call pg_promote_now().  I don't
think I like that better, but it beats not having pg_promote_now().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Simon Riggs

On 25 September 2014 15:35, Robert Haas robertmh...@gmail.com wrote:

 The only problem I see is if the newly inserted row matches one row on
 one unique value and a different row on a different unique index.
 Turning the INSERT into an UPDATE will still fail on one or other, no
 matter which index you pick. If there is one row for ALL unique
 indexes then it is irrelevant which index you pick. So either way, I
 cannot see a reason to specify an index.

 Failure could be the right thing in some cases.  For example, imagine
 that a user has a table containing names, email addresses, and (with
 apologies for the American-ism, but I don't know what would be
 comparable elsewhere) social security numbers.  The user has unique
 indexes on both email addresses and SSNs.  If a new record arrives for
 the same email address, they want to replace the existing record; but
 a new record arrives with the same SSN, they want the transaction to
 fail.  Otherwise, a newly-arrived record might overwrite the email
 address of an existing record, which they never want to do, because
 they view email address as the primary key.

I agree with your example, but not your conclusion.

If a new record arrives with a new email address that matches an
existing record it will fail. There is a case that would be allowed,
which would be a record that creates an entirely new email address. So
you do have a point to argue from.

However, IMV enforcing such a restriction should be done with an After
trigger, which is already possible, not by complicating a DML
statement with information it shouldn't need to know, or that might
change in the future.

Let's keep this new feature as simple as possible. ORMs everywhere
need to be encouraged to implement this and they won't do it unless it
is bone simple to use.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
 On Thu, Sep 25, 2014 at 12:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  TBH I've also been wondering whether any of these proposed cures are
  better than the disease.
 
 I couldn't agree more.  There's something to be said for just leaving
 this alone.

I've been coming around to this also.  I had thought earlier that there
was consensus happening, but clearly that's not the case.

  The changes that can be argued to make the
  behavior more sane are also ones that introduce backwards compatibility
  issues of one magnitude or another.
 
 But on this point I think David Johnston said it best:
 
 # Any change has the potential to draw complaints.  For you it seems that 
 hey,
 # I upgraded to 9.5 and my logs are being rotated out every minute now.  I
 # thought I had that turned off is the desired complaint.  Greg wants: hey, 
 my
 # 1 hour log rotation is now happening every minute.  If the error message is
 # written correctly most people upon seeing the error will simply fix their
 # configuration and move on - regardless of whether they were proactive in
 # doing so having read the release notes.
 
 I particularly agree with his first sentence - any change can
 potentitally draw complaints.  But I also agree with his last one - of
 those three possible complaints, I certainly prefer I had to fix my
 configuration file for the new, stricter validation over any variant
 of my configuration file still worked but it did something
 surprisingly different from what it used to do..

I'll agree with this also (which is why I had suggested moving forward
with the idea that I thought had consensus- keep things the way
they are, but toss an error if we round down a non-zero value to zero).

As with Tom, I'm not against being argued to a different position, such
as rounding up instead of down, but I still don't like the near-zero
goes to zero situation we currently have.  I'd be much happier if we'd
pick one or the other and move forward with it, or agree that we can't
reach a consensus and leave well enough alone.  Not entirely sure what
the best way to get to one of the above is, but I don't feel like we're
really making much more progress at this point.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Andres Freund

On 2014-09-24 21:36:50 +0100, Simon Riggs wrote:
 On 18 September 2014 01:22, Robert Haas robertmh...@gmail.com wrote:
 
  fast promotion was actually a supported option in r8 of Postgres but
  this option was removed when we implemented streaming replication in
  r9.0
 
  The *rough* requirement is sane, but that's not the same thing as
  saying this exact patch makes sense.
 
  Granted.  Fair point.
 
  If you are paused and you can see that WAL up ahead is damaged, then
  YES, you do want to avoid applying it. That is possible by setting a
  PITR target so that recovery stops at a precise location specified by
  you. As an existing option is it better than the blunt force trauma
  suggested here.
 
  You can pause at a recovery target, but then what if you want to go
  read/write at that point?  Or what if you've got a time-delayed
  standby and you want to break replication so that it doesn't replay
  the DROP TABLE students that somebody ran on the master?  It doesn't
  have to be that WAL is unreadable or corrupt; it's enough for it to
  contain changes you wish to avoid replaying.
 
  If you really don't care, just shutdown server, resetxlog and start
  her up - again, no need for new option.

I think that should pretty much never be something an admin has to
run. It's just about impossible to get this right. In all likelihood
just running pg_resetxlog on a database in recovery will have corrupted
your database.
Which is why pg_resetxlog won't even let you proceed without using -f
because it checks for DB_SHUTDOWNED. Rightly so.

pg_resetxlog *removes* *all* existing WAL and sets the current control
file state to DB_SHUTDOWNED. Thus there will be no recovery when
starting afterwards.

  To me, being able to say pg_ctl promote_right_now -m yes_i_mean_it
  seems like a friendlier interface than making somebody shut down the
  server, run pg_resetxlog, and start it up again.
 
 It makes sense to go from paused -- promoted.
 
 It doesn't make sense to go from normal running -- promoted, since
 that is just random data loss.

Why? I don't see what's random in promoting a node in the current state
*iff* it's currently consistent.

Just imagine something like promoting a current standby to a full node
because you want to run some tests on it that require writes. There's
absolutely no need to investigate the current state for that.

 I very much understand the case where
 somebody is shouting get the web site up, we are losing business.
 Implementing a feature that allows people to do exactly what they
 asked (go live now), but loses business transactions that we thought
 had been safely recorded is not good. It implements only the exact
 request, not its actual intention.

That seems to be a problem of massively understanding on the part of the
user. And I don't see how this is going to be safer by requiring the
user to first issue a pause reuest.

I think we should attempt to solve this by naming the command
appropriately. Something like 'abort_replay_and_promote'. Long,
nontrivial to type, and descriptive.

 Any feature that lumps both cases together is wrongly designed and
 will cause data loss.
 
 We go to a lot of trouble to ensure data is successfully on disk and
 in WAL. I won't give that up, nor do I want to make it easier to lose
 data than it already is.

I think that's not really related. Such a promotion doesn't cause data
loss in the sense of loosing data a *clueful* operator wanted to
keep. Yes, it can be used wrongly, but it's far from alone in that.

Greetings,

Andres Freund


-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Andres Freund

On 2014-09-25 11:13:50 -0400, Robert Haas wrote:
 On Wed, Sep 24, 2014 at 4:36 PM, Simon Riggs si...@2ndquadrant.com wrote:
  To me, being able to say pg_ctl promote_right_now -m yes_i_mean_it
  seems like a friendlier interface than making somebody shut down the
  server, run pg_resetxlog, and start it up again.
 
  It makes sense to go from paused -- promoted.
 
 Well, we could certainly require that people call
 pg_xlog_replay_pause() before they call pg_promote_now().  I don't
 think I like that better, but it beats not having pg_promote_now().

FWIW I think it's a mistake to only allow this on a hot standby. This
should be doable using pg_ctl alone. a) works with wal_level=archive, b)
sometimes hot_standby=off is a good bit more efficient c) sometimes you
don't want to allow connections.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] delta relations in AFTER triggers

2014-09-25 Thread Kevin Grittner

Heikki Linnakangas hlinnakan...@vmware.com wrote:

 You cast the TuplestoreRelation to Plan, and pass it to CopyPlanFields.
 That will crash, because TuplestoreRelation is nothing like a Plan:

Oops.  That's a copy/paste error I should have noticed.  Fixed,
even though the node type might be going away.  Since all of this
seems to be working very well from a user point of view, I'm going
to try to generate a lot more regression tests against the existing
code before taking another run at the API, to make sure that things
don't break in the refactoring.

I didn't hit the copy/out bugs in testing so far -- any suggestions
on a test that would exercise this code?  (I'm probably missing
something obvious.)



Craig Ringer cr...@2ndquadrant.com wrote:

 On 09/15/2014 10:25 PM, Kevin Grittner wrote:

  I broke out the changes from the previous patch in multiple commits
  in my repository on github:

 *Thankyou*

 A nice patch series published in a git repo is so much easier to work
 with than a giant squashed patch as an attachment.

I have fixed the bug reported by Heikki; be sure to grab that.

I have been merging in changes to master as I go, so that bit rot
doesn't accumulate, but I don't squash or rebase; hopefully that
style works for you.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Andres Freund

On 2014-09-19 15:40:14 +0300, Heikki Linnakangas wrote:
 On 09/18/2014 09:27 PM, Heikki Linnakangas wrote:
 I'll try to write a more polished patch tomorrow. We'll then see what it
 looks like, and can decide if we want it.
 
 Ok, here are two patches. One is a refined version of my earlier patch, and
 the other implements the separate offsets array approach. They are both
 based on Tom's jsonb-lengths-merged.patch, so they include all the
 whitespace fixes etc. he mentioned.
 
 There is no big difference in terms of code complexity between the patches.
 IMHO the separate offsets array is easier to understand, but it makes for
 more complicated accessor macros to find the beginning of the
 variable-length data.

I personally am pretty clearly in favor of Heikki's version. I think it
could stand to slightly expand the reasoning behind the mixed
length/offset format; it's not immediately obvious why the offsets are
problematic for compression. Otherwise, based on a cursory look, it
looks good.

But independent of which version is chosen, we *REALLY* need to make the
decision soon. This issue has held up the next beta (like jsonb has
blocked previous beta) for *weeks*.

Personally it doesn't make me very happy that Heikki and Tom had to be
the people stepping up to fix this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RLS Design

2014-09-25 Thread Thom Brown

On 25 September 2014 15:26, Stephen Frost sfr...@snowman.net wrote:
 I expected this to still trigger an error due to the first policy.  Am
 I to infer from this that the policy model is permissive rather than
 restrictive?

 That's correct and I believe pretty clear in the documentation- policies
 are OR'd together, just the same as how roles are handled.  As a
 logged-in user, you have the rights of all of the roles you are a member
 of (subject to inheiritance rules, of course), and similairly, you are
 able to view and add all rows which match any policy which applies to
 you (either through role membership or through different policies).

Okay, I see now.  This is a mindset issue for me as I'm looking at
them like constraints rather than permissions.  Thanks for the
explanation.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 11:34 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-25 11:13:50 -0400, Robert Haas wrote:
 On Wed, Sep 24, 2014 at 4:36 PM, Simon Riggs si...@2ndquadrant.com wrote:
  To me, being able to say pg_ctl promote_right_now -m yes_i_mean_it
  seems like a friendlier interface than making somebody shut down the
  server, run pg_resetxlog, and start it up again.
 
  It makes sense to go from paused -- promoted.

 Well, we could certainly require that people call
 pg_xlog_replay_pause() before they call pg_promote_now().  I don't
 think I like that better, but it beats not having pg_promote_now().

 FWIW I think it's a mistake to only allow this on a hot standby. This
 should be doable using pg_ctl alone. a) works with wal_level=archive, b)
 sometimes hot_standby=off is a good bit more efficient c) sometimes you
 don't want to allow connections.

Good point.  Also, a pg_ctl command is more friendly to cluster-ware,
of which there is a lot these days.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 11:21 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 25 September 2014 15:35, Robert Haas robertmh...@gmail.com wrote:
 The only problem I see is if the newly inserted row matches one row on
 one unique value and a different row on a different unique index.
 Turning the INSERT into an UPDATE will still fail on one or other, no
 matter which index you pick. If there is one row for ALL unique
 indexes then it is irrelevant which index you pick. So either way, I
 cannot see a reason to specify an index.

 Failure could be the right thing in some cases.  For example, imagine
 that a user has a table containing names, email addresses, and (with
 apologies for the American-ism, but I don't know what would be
 comparable elsewhere) social security numbers.  The user has unique
 indexes on both email addresses and SSNs.  If a new record arrives for
 the same email address, they want to replace the existing record; but
 a new record arrives with the same SSN, they want the transaction to
 fail.  Otherwise, a newly-arrived record might overwrite the email
 address of an existing record, which they never want to do, because
 they view email address as the primary key.

 I agree with your example, but not your conclusion.

 If a new record arrives with a new email address that matches an
 existing record it will fail. There is a case that would be allowed,
 which would be a record that creates an entirely new email address. So
 you do have a point to argue from.

 However, IMV enforcing such a restriction should be done with an After
 trigger, which is already possible, not by complicating a DML
 statement with information it shouldn't need to know, or that might
 change in the future.

I've never been a fan of putting the index name in there.  I agree
that's stuff that a DML statement shouldn't need to know about.  What
I've advocated for in the past is specifying the list of columns that
should be used to determine whether to insert or update.  If you have
a match on those columns, update the row; else insert.  Any other
unique indexes stand or fall as may be.

I still think that idea has merit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Robert Haas

On Wed, Sep 24, 2014 at 7:04 PM, Peter Geoghegan p...@heroku.com wrote:
 On Fri, Sep 19, 2014 at 2:54 PM, Peter Geoghegan p...@heroku.com wrote:
 Probably not - it appears to make very little difference to
 unoptimized pass-by-reference types whether or not datum1 can be used
 (see my simulation of Kevin's worst case, for example [1]). Streaming
 through a not inconsiderable proportion of memtuples again is probably
 a lot worse. The datum1 optimization (which is not all that old) made
 a lot of sense when initially introduced, because it avoided chasing
 through a pointer for pass-by-value types. I think that's its sole
 justification, though.

 Just to be clear -- I am blocked on this. Do you really prefer to
 restart copying heap tuples from scratch in the event of aborting,
 just to make sure that the datum1 representation is consistently
 either a pointer to text, or an abbreviated key? I don't think that
 the costs involved make that worth it, as I've said, but I'm not sure
 how to resolve that controversy.

 I suggest that we focus on making sure the abort logic itself is
 sound. There probably hasn't been enough discussion of that. Once that
 is resolved, we can revisit the question of whether or not copying
 should restart to keep the datum1 representation consistent. I suspect
 that leaving that until later will be easier all around.

The top issue on my agenda is figuring out a way to get rid of the
extra SortSupport object.  I'm not going to commit any version of this
patch that uses a second SortSupport for the tiebreak.  I doubt anyone
else will like that either, but you can try.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Josh Berkus

On 09/25/2014 09:01 AM, Andres Freund wrote:
 But independent of which version is chosen, we *REALLY* need to make the
 decision soon. This issue has held up the next beta (like jsonb has
 blocked previous beta) for *weeks*.

Yes, please!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-25 Thread Emre Hasegeli

 Fixed, thanks.

Here are my questions and comments about the code.

doc/src/sgml/gist.sgml:812:
be rechecked from heap tuple before tuple is returned.  If
literalrecheck/ flag isn't set then it's true by default for
compatibility reasons.  The literalrecheck/ flag can be used only

Recheck flag is set to false on gistget.c so I think it should say
false by default.  On the other hand, it is true by default on
the consistent function.  It is written as the safest assumption
on the code comments.  I don't know why the safest is chosen over
the backwards compatible for the consistent function.

src/backend/access/gist/gistget.c:505:
   /* Recheck distance from heap tuple if needed */
   if (GISTSearchItemIsHeap(*item) 
   searchTreeItemNeedDistanceRecheck(scan, 
 so-curTreeItem))
   {
   searchTreeItemDistanceRecheck(scan, 
 so-curTreeItem, item);
   continue;
   }

Why so-curTreeItem is passed to these functions?  They can use
scan-opaque-curTreeItem.

src/backend/access/gist/gistscan.c:49:
   /*
* When all distance values are the same, items without recheck
* can be immediately returned.  So they are placed first.
*/
   if (recheckCmp == 0  distance_a.recheck != distance_b.recheck)
   recheckCmp = distance_a.recheck ? 1 : -1;

I don't understand why items without recheck can be immediately
returned.  Do you think it will work correctly when there is
an operator class which will return recheck true and false for
the items under the same page?

src/backend/access/index/indexam.c:258:
   /* Prepare data structures for getting original indexed values from 
 heap */
   scan-indexInfo = BuildIndexInfo(scan-indexRelation);
   scan-estate = CreateExecutorState();
   scan-slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

With the changes in indexam.c, heap access become legal for all index
access methods.  I think it is better than the previous version but
I am leaving the judgement to someone experienced.  I will try to
summarize the pros and cons of sorting the rows in the GiST access
method, as far as I understand.

Pros:

* It does not require another queue.  It should be effective to sort
  the rows inside the queue the GiST access method already has.
* It does not complicate index access method infrastructure.

Cons:

* It could be done without additional heap access.
* Other access methods could make use of the sorting infrastructure
  one day.
* It could be more transparent to the users.  Sorting information
  could be shown on the explain output.
* A more suitable data structure like binary heap could be used
  for the queue to sort the rows.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-25 Thread Jeff Janes

On Wed, Sep 24, 2014 at 2:48 AM, Dilip kumar dilip.ku...@huawei.com wrote:

  On 24 August 2014 11:33, Amit Kapila Wrote



 Thanks for you comments, i have worked on both the review comment lists,
 sent on 19 August, and 24 August.



 Latest patch is attached with the mail..


Hi Dilip,

I think you have an off-by-one error in the index into the array of file
handles.

vacuumdb runs at full CPU, and if I run:

strace -ttt -T -f ../parallel_vac/bin/vacuumdb -z -a -j 8

I get the select returning immediately with a bad file descriptor error:

1411663937.641177 select(55, [4 5 6 7 8 9 10 54], NULL, NULL, NULL) = -1
EBADF (Bad file descriptor) 0.12
1411663937.641232 recvfrom(3, 0x104e3f0, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.12
1411663937.641279 recvfrom(4, 0x1034bc0, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.11
1411663937.641326 recvfrom(5, 0x10017c0, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.14
1411663937.641415 recvfrom(6, 0x10097e0, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.20
1411663937.641487 recvfrom(7, 0x1012330, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.13
1411663937.641538 recvfrom(8, 0x101af00, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.11
1411663937.641584 recvfrom(9, 0x1023af0, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.12
1411663937.641631 recvfrom(10, 0x1054600, 16384, 0, 0, 0) = -1 EAGAIN
(Resource temporarily unavailable) 0.12

Cheers,

Jeff

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Bruce Momjian

On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote:
 But independent of which version is chosen, we *REALLY* need to make the
 decision soon. This issue has held up the next beta (like jsonb has
 blocked previous beta) for *weeks*.
 
 Personally it doesn't make me very happy that Heikki and Tom had to be
 the people stepping up to fix this.

I think there are a few reasons this has been delayed, aside from the
scheduling ones:

1.  compression issues were a surprise, and we are wondering if
there are any other surprises
2.  pg_upgrade makes future data format changes problematic
3.  9.3 multi-xact bugs spooked us into being more careful

I am not sure what we can do to increase our speed based on these items.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Simon Riggs

On 25 September 2014 16:29, Andres Freund and...@2ndquadrant.com wrote:

  To me, being able to say pg_ctl promote_right_now -m yes_i_mean_it
  seems like a friendlier interface than making somebody shut down the
  server, run pg_resetxlog, and start it up again.

 It makes sense to go from paused -- promoted.

 It doesn't make sense to go from normal running -- promoted, since
 that is just random data loss.

 Why? I don't see what's random in promoting a node in the current state
 *iff* it's currently consistent.

 Just imagine something like promoting a current standby to a full node
 because you want to run some tests on it that require writes. There's
 absolutely no need to investigate the current state for that.

 I very much understand the case where
 somebody is shouting get the web site up, we are losing business.
 Implementing a feature that allows people to do exactly what they
 asked (go live now), but loses business transactions that we thought
 had been safely recorded is not good. It implements only the exact
 request, not its actual intention.

 That seems to be a problem of massively understanding on the part of the
 user. And I don't see how this is going to be safer by requiring the
 user to first issue a pause reuest.

 I think we should attempt to solve this by naming the command
 appropriately. Something like 'abort_replay_and_promote'. Long,
 nontrivial to type, and descriptive.

 Any feature that lumps both cases together is wrongly designed and
 will cause data loss.

 We go to a lot of trouble to ensure data is successfully on disk and
 in WAL. I won't give that up, nor do I want to make it easier to lose
 data than it already is.

 I think that's not really related. Such a promotion doesn't cause data
 loss in the sense of loosing data a *clueful* operator wanted to
 keep. Yes, it can be used wrongly, but it's far from alone in that.

Yes it does cause data loss. The clueful operator has no idea where
they are so there is no used rightly in that case.

If I were to give this feature a name it would be --discard or
--random-data-loss, or --reset-hard

The point of pausing is misunderstood. That is close but not quite relevant.

If you are at a known location and request promotion, we can presume
you know what you are doing, so it is simply Promote.

If you are at an unknown location and therefore have clearly not
verified any state before promotion, you are clearly making an
uninformed decision that will likely result in data loss, for which
there is no way of knowing the impact and no mechanism for recovering
from. Trying to promote something while it is still recovering proves
we don't know the state, we're just picking a random LSN.

So if you have a time delayed standby and the master breaks, or there
is a bad transaction then the correct action would be to
* pause the delayed standby
* discover where the master broke, or the xid of the bad transaction
* restart recovery to go up to the correct time/xid/lsn
* promote standby

That is already possible in 9.4

The original patch for pausing contained code to reset the PITR target
with functions, which would make the above even easier.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Josh Berkus

On 09/25/2014 10:14 AM, Bruce Momjian wrote:
 On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote:
 But independent of which version is chosen, we *REALLY* need to make the
 decision soon. This issue has held up the next beta (like jsonb has
 blocked previous beta) for *weeks*.

 Personally it doesn't make me very happy that Heikki and Tom had to be
 the people stepping up to fix this.
 
 I think there are a few reasons this has been delayed, aside from the
 scheduling ones:
 
   1.  compression issues were a surprise, and we are wondering if
   there are any other surprises
   2.  pg_upgrade makes future data format changes problematic
   3.  9.3 multi-xact bugs spooked us into being more careful
 
 I am not sure what we can do to increase our speed based on these items.

Alternately, this is delayed because:

1. We have one tested patch to fix the issue.

2. However, people are convinced that there's a better patch possible.

3. But nobody is working on this better patch except in their spare time.

Given this, I once again vote for releasing based on Tom's lengths-only
patch, which is done, tested, and ready to go.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Andres Freund

On 2014-09-25 10:18:24 -0700, Josh Berkus wrote:
 On 09/25/2014 10:14 AM, Bruce Momjian wrote:
  On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote:
  But independent of which version is chosen, we *REALLY* need to make the
  decision soon. This issue has held up the next beta (like jsonb has
  blocked previous beta) for *weeks*.
 
  Personally it doesn't make me very happy that Heikki and Tom had to be
  the people stepping up to fix this.
  
  I think there are a few reasons this has been delayed, aside from the
  scheduling ones:
  
  1.  compression issues were a surprise, and we are wondering if
  there are any other surprises
  2.  pg_upgrade makes future data format changes problematic
  3.  9.3 multi-xact bugs spooked us into being more careful
  
  I am not sure what we can do to increase our speed based on these items.
 
 Alternately, this is delayed because:
 
 1. We have one tested patch to fix the issue.
 
 2. However, people are convinced that there's a better patch possible.
 
 3. But nobody is working on this better patch except in their spare time.
 
 Given this, I once again vote for releasing based on Tom's lengths-only
 patch, which is done, tested, and ready to go.

Heikki's patch is there and polished.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] KNN-GiST with recheck

2014-09-25 Thread Alexander Korotkov

On Thu, Sep 25, 2014 at 9:00 PM, Emre Hasegeli e...@hasegeli.com wrote:

  Fixed, thanks.

 Here are my questions and comments about the code.

 doc/src/sgml/gist.sgml:812:
 be rechecked from heap tuple before tuple is returned.  If
 literalrecheck/ flag isn't set then it's true by default for
 compatibility reasons.  The literalrecheck/ flag can be used
 only

 Recheck flag is set to false on gistget.c so I think it should say
 false by default.  On the other hand, it is true by default on
 the consistent function.  It is written as the safest assumption
 on the code comments.  I don't know why the safest is chosen over
 the backwards compatible for the consistent function.


Agree. It should be clarified in docs.

src/backend/access/gist/gistget.c:505:
/* Recheck distance from heap tuple if needed */
if (GISTSearchItemIsHeap(*item) 
searchTreeItemNeedDistanceRecheck(scan,
 so-curTreeItem))
{
searchTreeItemDistanceRecheck(scan,
 so-curTreeItem, item);
continue;
}

 Why so-curTreeItem is passed to these functions?  They can use
 scan-opaque-curTreeItem.


I didn't get the difference. Few lines before:

GISTScanOpaque so = (GISTScanOpaque) scan-opaque;


 src/backend/access/gist/gistscan.c:49:
/*
 * When all distance values are the same, items without
 recheck
 * can be immediately returned.  So they are placed first.
 */
if (recheckCmp == 0  distance_a.recheck !=
 distance_b.recheck)
recheckCmp = distance_a.recheck ? 1 : -1;

 I don't understand why items without recheck can be immediately
 returned.  Do you think it will work correctly when there is
 an operator class which will return recheck true and false for
 the items under the same page?


Yes, I believe so. Item with recheck can't decrease it's distance, it can
only increase it. In the corner case  item can have same distance after
recheck as it was before. Then anyway items which distances are the same
can be returned in any order.


 src/backend/access/index/indexam.c:258:
/* Prepare data structures for getting original indexed values
 from heap */
scan-indexInfo = BuildIndexInfo(scan-indexRelation);
scan-estate = CreateExecutorState();
scan-slot =
 MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));

 With the changes in indexam.c, heap access become legal for all index
 access methods.  I think it is better than the previous version but
 I am leaving the judgement to someone experienced.  I will try to
 summarize the pros and cons of sorting the rows in the GiST access
 method, as far as I understand.

 Pros:

 * It does not require another queue.  It should be effective to sort
   the rows inside the queue the GiST access method already has.
 * It does not complicate index access method infrastructure.

 Cons:

 * It could be done without additional heap access.
 * Other access methods could make use of the sorting infrastructure
   one day.
 * It could be more transparent to the users.  Sorting information
   could be shown on the explain output.


It would be also nice to show some information about KNN itself.


 * A more suitable data structure like binary heap could be used
   for the queue to sort the rows.


Binary heap seems to be better data structure for whole KNN-GiST. But it's
a subject for a separate patch: replace RB-tree to heap in KNN-GiST. It's
not related to recheck stuff.

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Josh Berkus

On 09/25/2014 10:20 AM, Andres Freund wrote:
 On 2014-09-25 10:18:24 -0700, Josh Berkus wrote:
 On 09/25/2014 10:14 AM, Bruce Momjian wrote:
 On Thu, Sep 25, 2014 at 06:01:08PM +0200, Andres Freund wrote:
 But independent of which version is chosen, we *REALLY* need to make the
 decision soon. This issue has held up the next beta (like jsonb has
 blocked previous beta) for *weeks*.

 Personally it doesn't make me very happy that Heikki and Tom had to be
 the people stepping up to fix this.

 I think there are a few reasons this has been delayed, aside from the
 scheduling ones:

 1.  compression issues were a surprise, and we are wondering if
 there are any other surprises
 2.  pg_upgrade makes future data format changes problematic
 3.  9.3 multi-xact bugs spooked us into being more careful

 I am not sure what we can do to increase our speed based on these items.

 Alternately, this is delayed because:

 1. We have one tested patch to fix the issue.

 2. However, people are convinced that there's a better patch possible.

 3. But nobody is working on this better patch except in their spare time.

 Given this, I once again vote for releasing based on Tom's lengths-only
 patch, which is done, tested, and ready to go.
 
 Heikki's patch is there and polished.

If Heikki says it's ready, I'll test.  So far he's said that it wasn't
done yet.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Andres Freund

On 2014-09-25 18:18:09 +0100, Simon Riggs wrote:
 On 25 September 2014 16:29, Andres Freund and...@2ndquadrant.com wrote:
  I think that's not really related. Such a promotion doesn't cause data
  loss in the sense of loosing data a *clueful* operator wanted to
  keep. Yes, it can be used wrongly, but it's far from alone in that.
 
 Yes it does cause data loss. The clueful operator has no idea where
 they are so there is no used rightly in that case.

What? There definitely are cases where you don't need to know that to
the T. Just think of the - quite frequently happening - need to promote
a standby to run tests or reporting queries that can't be run on a
standby.

Sure, you shouldn't use it if you expect a very specific set of the data
being there, but that's not always necessary. And that's why it should
never, ever be the default.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Josh Berkus

On 09/25/2014 10:26 AM, Andres Freund wrote:
 On 2014-09-25 10:25:24 -0700, Josh Berkus wrote:
 If Heikki says it's ready, I'll test.  So far he's said that it wasn't
 done yet.
 
 http://www.postgresql.org/message-id/541c242e.3030...@vmware.com

Yeah, and that didn't include some of Tom's bug fixes apparently, per
the succeeding message.  Which is why I asked Heikki if he was done, to
which he has not replied.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Andres Freund

On 2014-09-25 10:29:51 -0700, Josh Berkus wrote:
 On 09/25/2014 10:26 AM, Andres Freund wrote:
  On 2014-09-25 10:25:24 -0700, Josh Berkus wrote:
  If Heikki says it's ready, I'll test.  So far he's said that it wasn't
  done yet.
  
  http://www.postgresql.org/message-id/541c242e.3030...@vmware.com
 
 Yeah, and that didn't include some of Tom's bug fixes apparently, per
 the succeeding message.  Which is why I asked Heikki if he was done, to
 which he has not replied.

Well, Heikki said he doesn't see any fixes in Tom's patch. But either
way, this isn't anything that should prevent you from testing.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Andres Freund

On 2014-09-25 10:25:24 -0700, Josh Berkus wrote:
 If Heikki says it's ready, I'll test.  So far he's said that it wasn't
 done yet.

http://www.postgresql.org/message-id/541c242e.3030...@vmware.com

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 1:30 PM, Andres Freund and...@2ndquadrant.com wrote:
 Yes it does cause data loss. The clueful operator has no idea where
 they are so there is no used rightly in that case.

 What? There definitely are cases where you don't need to know that to
 the T. Just think of the - quite frequently happening - need to promote
 a standby to run tests or reporting queries that can't be run on a
 standby.

 Sure, you shouldn't use it if you expect a very specific set of the data
 being there, but that's not always necessary.

Very well put.  +1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread Peter Eisentraut

On 9/25/14 11:03 AM, Robert Haas wrote:
 I couldn't agree more.  There's something to be said for just leaving
 this alone.

I agree.

 potentitally draw complaints.  But I also agree with his last one - of
 those three possible complaints, I certainly prefer I had to fix my
 configuration file for the new, stricter validation over any variant
 of my configuration file still worked but it did something
 surprisingly different from what it used to do..

Yes.  I don't mind that we rename parameters from time to time when
semantics changed or are refined.  But having the same parameter setting
mean different things in different versions is the path to complete madness.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas robertmh...@gmail.com wrote:
 The top issue on my agenda is figuring out a way to get rid of the
 extra SortSupport object.

Really? I'm surprised. Clearly the need to restart heap tuple copying
from scratch, in order to make the datum1 representation consistent,
rather than abandoning datum1 for storing abbreviated keys or pointers
entirely is a very important aspect of whether or not we should change
that. In turn, that's something that's going to (probably
significantly) affect the worst case.

Do you have an opinion on that? If you want me to start from scratch,
and then have a consistent datum1 representation, and then be able to
change the structure of comparetup_heap() as you outline (so as to get
rid of the extra SortSupport object), I can do that. My concern is the
regression. The datum1 pointer optimization appears to matter very
little for pass by value types (traditionally, before abbreviated
keys), and so I have a hard time imagining this working out.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review of GetUserId() Usage

2014-09-25 Thread Peter Eisentraut

On 9/24/14 4:58 PM, Stephen Frost wrote:
 Alvaro,
 
 * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 I think the case for pgstat_get_backend_current_activity() and
 pg_stat_get_activity and the other pgstatfuncs.c callers is easy to make
 and seems acceptable to me; but I would leave pg_signal_backend out of
 that discussion, because it has a potentially harmful side effect.  By
 requiring SET ROLE you add an extra layer of protection against
 mistakes.  (Hopefully, pg_signal_backend() is not a routine thing for
 well-run systems, which means human intervention, and therefore the room
 for error isn't insignificant.)
 
 While I certainly understand where you're coming from, I don't really
 buy into it.  Yes, cancelling a query (the only thing normal users can
 do anyway- they can't terminate backends) could mean the loss of any
 in-progress work, but it's not like 'rm' and I don't see that it needs
 to require extra hoops for individuals to go through.

It would be weird if it were inconsistent: some things require role
membership, some things require SET ROLE.  Try explaining that.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Simon Riggs

On 28 August 2014 03:43, Peter Geoghegan p...@heroku.com wrote:

 Value locking
 ===

 To date, on-list discussion around UPSERT has almost exclusively
 concerned what I've called value locking; the idea of locking values
 in unique indexes in the abstract (to establish the right to insert
 ahead of time). There was some useful discussion on this question
 between myself and Heikki back around December/January. Ultimately, we
 were unable to reach agreement on an approach and discussion tapered
 off. However, Heikki did understand the concerns that informed by
 design. He recognized the need to be able to easily *release* value
 locks, so as to avoid unprincipled deadlocks, where under high
 concurrency there are deadlocks between sessions that only UPSERT a
 single row at a time. I'm not sure how widely appreciated this point
 is, but I believe that Heikki appreciates it. It is a very important
 point in my opinion. I don't want an implementation that is in any way
 inferior to the UPSERT looping subxact pattern does (i.e. the plpsql
 thing that the docs suggest).

 When we left off, Heikki continued to favor an approach that involved
 speculatively inserting heap tuples, and then deleting them in the
 event of a conflict. This design was made more complicated when the
 need to *release* value locks became apparent (Heikki ended up making
 some changes to HeapTupleSatisfiesDirty(), as well as sketching a
 design for what you might call a super delete, where xmin can be set
 to InvalidTransactionId for speculatively-inserted heap tuples). After
 all, it wasn't as if we could abort a subxact to release locks, which
 is what the UPSERT looping subxact pattern does. I think it's fair
 to say that that design became more complicated than initially
 anticipated [4] [5].

 Anyway, the greater point here is that fundamentally, AFAICT Heikki
 and I were in agreement. Once you buy into the idea that we must avoid
 holding on to value locks of whatever form - as Heikki evidently did
 - then exactly what form they take is ultimately only a detail.
 Granted, it's a very important detail, but a detail nonetheless. It
 can be discussed entirely independently of all of this new stuff, and
 thank goodness for that.

 If anyone finds my (virtually unchanged) page heavyweight lock based
 value locking approach objectionable, I ask that the criticism be
 framed in a way that makes a sharp distinction between each of the
 following:

 1. You don't accept that value locks must be easily released in the
 event of a conflict. Is anyone in this camp? It's far from obvious to
 me what side of this question Andres is on at this stage, for example.
 Robert might have something to say here too.

 2. Having taken into account the experience of myself and Heikki, and
 all that is implied by taking that approach ***while avoiding
 unprincipled deadlocks***, you continue to believe that an approach
 based on speculative heap insertion, or some alternative scheme is
 better than what I have done to the nbtree code here, or you otherwise
 dislike something about the proposed value locking scheme. You accept
 that value locks must be released and released easily in the event of
 a conflict, but like Heikki you just don't like what I've done to get
 there.

 Since we can (I believe) talk about the value locking aspect and the
 rest of the patch independently, we should do so...unless you're in
 camp 1, in which case I guess that we'll have to thrash it out.

I'm trying to understand and help out with pushing this patch forwards
to completion.

Basically, I have absolutely no idea whether I object to or agree with
1) and don't know where to look to find out. We need a clear
exposition of design and the alternatives.

My approach would be to insert an index tuple for that value into the
index, but with the leaf ituple marked with an xid rather than a ctid.
If someone tries to insert into the index they would see this and wait
for the inserting transaction to end. The inserting transaction would
then resolve what happens in the heap (insert/update) and later
repoint the index tuple to the inserted/updated row version. I don't
see the need for page level locking since it would definitely result
in deadlocks (e.g. SQLServer).

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Tom Lane

Josh Berkus j...@agliodbs.com writes:
 On 09/25/2014 10:26 AM, Andres Freund wrote:
 On 2014-09-25 10:25:24 -0700, Josh Berkus wrote:
 If Heikki says it's ready, I'll test.  So far he's said that it wasn't
 done yet.

 http://www.postgresql.org/message-id/541c242e.3030...@vmware.com

 Yeah, and that didn't include some of Tom's bug fixes apparently, per
 the succeeding message.  Which is why I asked Heikki if he was done, to
 which he has not replied.

I took a quick look at the two patches Heikki posted.  I find the
separate offsets array approach unappealing.  It takes more space
than the other approaches, and that space will be filled with data
that we already know will not be at all compressible.  Moreover,
AFAICS we'd have to engrave the stride on stone tablets, which as
I already mentioned I'd really like to not do.

The offsets-and-lengths patch seems like the approach we ought to
compare to my patch, but it looks pretty unfinished to me: AFAICS it
includes logic to understand offsets sprinkled into a mostly-lengths
array, but no logic that would actually *store* any such offsets,
which means it's going to act just like my patch for performance
purposes.

In the interests of pushing this forward, I will work today on
trying to finish and review Heikki's offsets-and-lengths patch
so that we have something we can do performance testing on.
I doubt that the performance testing will tell us anything we
don't expect, but we should do it anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Josh Berkus

On 09/25/2014 11:22 AM, Tom Lane wrote:
 In the interests of pushing this forward, I will work today on
 trying to finish and review Heikki's offsets-and-lengths patch
 so that we have something we can do performance testing on.
 I doubt that the performance testing will tell us anything we
 don't expect, but we should do it anyway.

OK.  I'll spend some time trying to get Socorro with JSONB working so
that I'll have a second test case.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 11:17 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Basically, I have absolutely no idea whether I object to or agree with
 1) and don't know where to look to find out. We need a clear
 exposition of design and the alternatives.

 My approach would be to insert an index tuple for that value into the
 index, but with the leaf ituple marked with an xid rather than a ctid.
 If someone tries to insert into the index they would see this and wait
 for the inserting transaction to end. The inserting transaction would
 then resolve what happens in the heap (insert/update) and later
 repoint the index tuple to the inserted/updated row version. I don't
 see the need for page level locking since it would definitely result
 in deadlocks (e.g. SQLServer).

The page level locks are only used to prevent concurrent insertion for
as long as it takes to get consensus to proceed among unique indexes,
and to actually insert a heap tuple. They're all released before we
lock the tuple for update, should we take that path (yes, really).
This is consistent with the behavior of other systems, I think. That's
my whole reason for preferring to do things that way. If you have a
promise tuples approach - be it what you outline here, or what
Heikki prototyped with heap tuple insertion, or any other - then you
need a way to *release* those value locks in the event of a
conflict/needing to update, before locking/updating. Otherwise, you
get deadlocks. This is an issue I highlighted when it came up with
Heikki's prototype.

AFAICT, any scheme for value locking needs to strongly consider the
need to *release* value locks inexpensively. Whatever else they do,
they cannot persist for the duration of the transaction IMV.

Does that make sense? If not, my next suggestion is applying an
earlier revision of Heikki's prototype, and seeing for yourself how it
can be made to deadlock in an unprincipled/impossible to prevent way
[1]. You've quite rightly highlighted the existing subxact looping
pattern as something that this needs to be better than in every way.
This is one important way in which we might fail to live up to that
standard.

[1] http://www.postgresql.org/message-id/52b4aaf0.5090...@vmware.com
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Tom Lane

BTW, it seems like there is consensus that we ought to reorder the items
in a jsonb object to have keys first and then values, independently of the
other issues under discussion.  This means we *will* be breaking on-disk
compatibility with 9.4beta2, which means pg_upgrade will need to be taught
to refuse an upgrade if the database contains any jsonb columns.  Bruce,
do you have time to crank out a patch for that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 7:35 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 25, 2014 at 10:12 AM, Simon Riggs si...@2ndquadrant.com wrote:
 IMHO it is impossible to know if any of the other code is correct
 until we have a clear and stable vision of what the command is
 supposed to perform.

 +1.

 The inner workings are less important than what the feature does.

 +1.

 FWIW, the row available at the end of all BEFORE triggers is clearly
 the object we should be manipulating, not the original VALUES()
 clause. Otherwise this type of INSERT would behave differently from
 normal INSERTs. Which would likely violate RLS, if nothing else.

 +1.

I agree with all of this. I'm glad that my opinion on how a
CONFLICTING() expression interacts with BEFORE triggers is accepted,
too.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Bruce Momjian

On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote:
 BTW, it seems like there is consensus that we ought to reorder the items
 in a jsonb object to have keys first and then values, independently of the
 other issues under discussion.  This means we *will* be breaking on-disk
 compatibility with 9.4beta2, which means pg_upgrade will need to be taught
 to refuse an upgrade if the database contains any jsonb columns.  Bruce,
 do you have time to crank out a patch for that?

Yes, I can do that easily.  Tell me when you want it --- I just need a
catalog version number to trigger on.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 2:05 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 25, 2014 at 9:21 AM, Robert Haas robertmh...@gmail.com wrote:
 The top issue on my agenda is figuring out a way to get rid of the
 extra SortSupport object.

 Really? I'm surprised. Clearly the need to restart heap tuple copying
 from scratch, in order to make the datum1 representation consistent,
 rather than abandoning datum1 for storing abbreviated keys or pointers
 entirely is a very important aspect of whether or not we should change
 that. In turn, that's something that's going to (probably
 significantly) affect the worst case.

 Do you have an opinion on that?

I haven't looked at that part of the patch in detail yet, so... not
really.  But I don't see why you'd ever need to restart heap tuple
copying.  At most you'd need to re-extract datum1 from the tuples you
have already copied.  To find out how much that optimization buys, you
should use tuples with many variable-length columns (say, 50)
preceding the text column you're sorting on. I won't be surprised if
that turns out to be expensive enough to be worth worrying about, but
I have not benchmarked it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 9:20 AM, Robert Haas robertmh...@gmail.com wrote:
 I've never been a fan of putting the index name in there.

Me neither. Although I do understand Kevin's concern about the user's
intent surrounding which unique index to merge on.

 I agree
 that's stuff that a DML statement shouldn't need to know about.  What
 I've advocated for in the past is specifying the list of columns that
 should be used to determine whether to insert or update.  If you have
 a match on those columns, update the row; else insert.  Any other
 unique indexes stand or fall as may be.

 I still think that idea has merit.

As I've said, my problem with that idea is the corner cases. Consider
the possible ambiguity. Could DML queries in production start failing
(ambiguous unique index specification) because the DBA created a new
unique index on attributes that somewhat overlap the attributes of the
unique index that the DML author actually meant? What about the
effects of BEFORE triggers, and their interaction with partial unique
indexes? If you can describe an exact behavior that overcomes these
issues, then I'll give serious consideration to implementing it. As
things stand, to be perfectly honest it sounds like a footgun to me.
There are interactions that make getting it right very ticklish.

I don't want to make a unique index specification mandatory because
that's ugly - that's the only reason, TBH. However, while what you
describe here accomplishes the same thing without being ugly, it is
potentially very surprising. Naming the unique index directly has the
great advantage of very clearly demonstrating user intent.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Andres Freund

On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote:
 On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote:
  BTW, it seems like there is consensus that we ought to reorder the items
  in a jsonb object to have keys first and then values, independently of the
  other issues under discussion.  This means we *will* be breaking on-disk
  compatibility with 9.4beta2, which means pg_upgrade will need to be taught
  to refuse an upgrade if the database contains any jsonb columns.  Bruce,
  do you have time to crank out a patch for that?
 
 Yes, I can do that easily.  Tell me when you want it --- I just need a
 catalog version number to trigger on.

Do you plan to make it conditional on jsonb being used in the database?
That'd not be bad to reduce the pain for testers that haven't used jsonb.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 2:17 PM, Simon Riggs si...@2ndquadrant.com wrote:
 1. You don't accept that value locks must be easily released in the
 event of a conflict. Is anyone in this camp? It's far from obvious to
 me what side of this question Andres is on at this stage, for example.
 Robert might have something to say here too.

 2. Having taken into account the experience of myself and Heikki, and
 all that is implied by taking that approach ***while avoiding
 unprincipled deadlocks***, you continue to believe that an approach
 based on speculative heap insertion, or some alternative scheme is
 better than what I have done to the nbtree code here, or you otherwise
 dislike something about the proposed value locking scheme. You accept
 that value locks must be released and released easily in the event of
 a conflict, but like Heikki you just don't like what I've done to get
 there.

 Since we can (I believe) talk about the value locking aspect and the
 rest of the patch independently, we should do so...unless you're in
 camp 1, in which case I guess that we'll have to thrash it out.

 I'm trying to understand and help out with pushing this patch forwards
 to completion.

 Basically, I have absolutely no idea whether I object to or agree with
 1) and don't know where to look to find out. We need a clear
 exposition of design and the alternatives.

I laughed when I read this, because I think a lot of the discussion on
this topic has been unnecessarily muddled by jargon.

 My approach would be to insert an index tuple for that value into the
 index, but with the leaf ituple marked with an xid rather than a ctid.
 If someone tries to insert into the index they would see this and wait
 for the inserting transaction to end. The inserting transaction would
 then resolve what happens in the heap (insert/update) and later
 repoint the index tuple to the inserted/updated row version. I don't
 see the need for page level locking since it would definitely result
 in deadlocks (e.g. SQLServer).

I think that something like this might work, but the devil is in the
details.  Suppose two people try to upsert into the same table at the
same time.  There's one index.  If the transactions search that index
for conflicts first, neither sees any conflicting tuples, and both
proceed.  That's no good.  OK, so suppose each transaction inserts the
special index tuple which you mention, to lock out concurrent inserts
of that value, and then searches for already-existing conflicts.  Each
sees the other's tuple, and they deadlock.  That's no good, either.

Also, I think there are other cases where we think we're going to
insert, so we put the special index tuple in there, but then we decide
to update, so we don't need the promise tuple any more, but other
sessions are potentially still waiting for our XID to terminate even
though there's no conflict any more.  I'm having a hard time bringing
the details of those cases to mind ATM, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 11:53 AM, Robert Haas robertmh...@gmail.com wrote:
 I haven't looked at that part of the patch in detail yet, so... not
 really.  But I don't see why you'd ever need to restart heap tuple
 copying.  At most you'd need to re-extract datum1 from the tuples you
 have already copied.

Well, okay, technically it's not restarting heap tuple copying, but
it's about the same thing. The point is that you have to stream a
significant chunk of the memtuples array through memory again. Not to
mention, having infrastructure to do that, and pick up where we left
off, which is significantly more code, all to make comparetup_heap() a
bit clearer (i.e. making it so that it won't have to think about an
extra sortsupport state).

 To find out how much that optimization buys, you
 should use tuples with many variable-length columns (say, 50)
 preceding the text column you're sorting on. I won't be surprised if
 that turns out to be expensive enough to be worth worrying about, but
 I have not benchmarked it.

Sorry, but I don't follow. I don't think the pertinent question is if
it's a noticeable cost. I think the pertinent question is if it's
worth it. Doing something about it necessitates a lot of extra memory
access. Not doing something about it hardly affects the amount of
memory access required, perhaps not at all.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Bruce Momjian

On Thu, Sep 25, 2014 at 09:00:07PM +0200, Andres Freund wrote:
 On 2014-09-25 14:46:18 -0400, Bruce Momjian wrote:
  On Thu, Sep 25, 2014 at 02:39:37PM -0400, Tom Lane wrote:
   BTW, it seems like there is consensus that we ought to reorder the items
   in a jsonb object to have keys first and then values, independently of the
   other issues under discussion.  This means we *will* be breaking on-disk
   compatibility with 9.4beta2, which means pg_upgrade will need to be taught
   to refuse an upgrade if the database contains any jsonb columns.  Bruce,
   do you have time to crank out a patch for that?
  
  Yes, I can do that easily.  Tell me when you want it --- I just need a
  catalog version number to trigger on.
 
 Do you plan to make it conditional on jsonb being used in the database?
 That'd not be bad to reduce the pain for testers that haven't used jsonb.

Yes, I already have code that scans pg_attribute looking for columns
with problematic data types and output them to a file, and then throw an
error.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] jsonb format is pessimal for toast compression

2014-09-25 Thread Alvaro Herrera

Bruce Momjian wrote:

   3.  9.3 multi-xact bugs spooked us into being more careful

Uh.  Multixact changes in 9.3 were infinitely more invasive than the
jsonb changes will ever be.  a) they touched basic visibility design and 
routines,
which are complex, understood by very few people, and have remained
mostly unchanged for ages; b) they changed on-disk format for an
underlying support structure, requiring pg_upgrade to handle the
conversion; c) they added new catalog infrastructure to keep track of
required freezing; d) they introduced new uint32 counters subject to
wraparound; e) they introduced a novel user of slru.c with 5-char long
filenames; f) they messed with tuple locking protocol and EvalPlanQual
logic for traversing update chains.  Maybe I'm forgetting others.

JSONB has none of these properties.  As far as I can see, the only hairy
issue here (other than getting Josh Berkus to actually test the proposed
patches) is that JSONB is changing on-disk format; but we're avoiding
most issues there by dictating that people with existing JSONB databases
need to pg_dump them, i.e. there is no conversion step being written for
pg_upgrade.

It's good to be careful; it's even better to be more careful.  I too
have learned a lesson there.

Anyway I have no opinion on the JSONB stuff, other than considering that
ignoring performance for large arrays and large objects seems to run
counter to the whole point of JSONB in the first place (and of course
failing to compress is part of that, too.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 12:11 PM, Robert Haas robertmh...@gmail.com wrote:
 I think that something like this might work, but the devil is in the
 details.  Suppose two people try to upsert into the same table at the
 same time.  There's one index.  If the transactions search that index
 for conflicts first, neither sees any conflicting tuples, and both
 proceed.  That's no good.  OK, so suppose each transaction inserts the
 special index tuple which you mention, to lock out concurrent inserts
 of that value, and then searches for already-existing conflicts.  Each
 sees the other's tuple, and they deadlock.  That's no good, either.

I'm very glad that you share my concern about deadlocks like this.

 Also, I think there are other cases where we think we're going to
 insert, so we put the special index tuple in there, but then we decide
 to update, so we don't need the promise tuple any more, but other
 sessions are potentially still waiting for our XID to terminate even
 though there's no conflict any more.  I'm having a hard time bringing
 the details of those cases to mind ATM, though.

Well, you might have a promise tuple in a unique index on attributes
not appearing in the UPDATE's targetlist, for one. You have the other
session waiting (doesn't have to be an upserter) just because we
*thought about* inserting a value as part of an upsert. That's pretty
bad.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 7:12 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The way forwards, in my view, is to define precisely the behaviour we
 wish to have. That definition will include the best current mechanism
 for running an UPSERT using INSERT/UPDATE/loops and comparing that
 against what is being provided here. We will then have a functional
 test of equivalence of the approaches, and a basis for making a
 performance test that shows that performance is increased without any
 loss of concurrency.

That sounds very reasonable.  While I'm sure that what I have here can
decisively beat the xact looping pattern in terms of performance as
measured by pgbench, the real performance advantage is that this
approach doesn't burn through XIDs. That was a concern that Andres
highlighted in relation to using the subxact looping pattern with
BDR's multi-master replication conflict resolution.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-09-25 Thread Jeff Janes

On Thu, Sep 25, 2014 at 10:00 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Wed, Sep 24, 2014 at 2:48 AM, Dilip kumar dilip.ku...@huawei.com
 wrote:

  On 24 August 2014 11:33, Amit Kapila Wrote



 Thanks for you comments, i have worked on both the review comment lists,
 sent on 19 August, and 24 August.



 Latest patch is attached with the mail..


 Hi Dilip,

 I think you have an off-by-one error in the index into the array of file
 handles.


Actually the problem is that the socket for the master connection was not
getting initialized, see my one line addition here.

   connSlot = (ParallelSlot*)pg_malloc(concurrentCons *
sizeof(ParallelSlot));
   connSlot[0].connection = conn;
+   connSlot[0].sock = PQsocket(conn);


However, I don't think it is good to just ignore errors from the select
call (like the EBADF) and go into a busy loop instead, so there are more
changes needed than this.

Also, cancelling the run (by hitting ctrl-C in the shell that invoked it)
does not seem to work on linux.  I get a message that says Cancel request
sent, but then it continues to finish the job anyway.

Cheers,

Jeff

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread Gregory Smith


On 9/25/14, 2:02 PM, Peter Eisentraut wrote:
But having the same parameter setting mean different things in 
different versions is the path to complete madness. 


Could we go so far as to remove support for unitless time settings 
eventually?  The fact that people are setting raw numbers in the 
configuration file and have to know the unit to understand what they 
just did has never been something I like.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Simon Riggs

On 25 September 2014 20:11, Robert Haas robertmh...@gmail.com wrote:

 My approach would be to insert an index tuple for that value into the
 index, but with the leaf ituple marked with an xid rather than a ctid.
 If someone tries to insert into the index they would see this and wait
 for the inserting transaction to end. The inserting transaction would
 then resolve what happens in the heap (insert/update) and later
 repoint the index tuple to the inserted/updated row version. I don't
 see the need for page level locking since it would definitely result
 in deadlocks (e.g. SQLServer).

 I think that something like this might work, but the devil is in the
 details.  Suppose two people try to upsert into the same table at the
 same time.  There's one index.  If the transactions search that index
 for conflicts first, neither sees any conflicting tuples, and both
 proceed.  That's no good.  OK, so suppose each transaction inserts the
 special index tuple which you mention, to lock out concurrent inserts
 of that value, and then searches for already-existing conflicts.  Each
 sees the other's tuple, and they deadlock.  That's no good, either.

The test index is unique, so our to-be-inserted value exists on only
one page, hence page locking applies while we insert it. The next
person to insert waits for the page lock and then sees the test tuple.

The page lock lasts only for the duration of the insertion of the
ituple, not for the whole operation.

 Also, I think there are other cases where we think we're going to
 insert, so we put the special index tuple in there, but then we decide
 to update, so we don't need the promise tuple any more, but other
 sessions are potentially still waiting for our XID to terminate even
 though there's no conflict any more.  I'm having a hard time bringing
 the details of those cases to mind ATM, though.

We make the decision to INSERT or UPDATE based upon what we find in
the test index. If a value if there already, we assume its an UPDATE
and go to update the row this points to. If it has been deleted we
loop back and try again/error. If the value is not present, we insert
the test tuple and progress as an INSERT, then loop back later to set
the ctid. There is no case of don't need promise id anymore. We
would use the PK, identity or first unique index as the test index.
There is a case where an UPSERT conflicts with an INSERT causing the
latter to abort.

Anyway, this is why we need the design more clearly exposed, so you
can tell me I'm wrong by showing me the URL of it done right.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Simon Riggs

On 25 September 2014 19:59, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 25, 2014 at 9:20 AM, Robert Haas robertmh...@gmail.com wrote:
 I've never been a fan of putting the index name in there.

 Me neither. Although I do understand Kevin's concern about the user's
 intent surrounding which unique index to merge on.

The use case cited is real. My solution of using an after trigger
works yet without adding specific functionality to this command. So we
can achieve what users want without complicating things here.

If we do decide we really want it, lets add it as a later patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review of GetUserId() Usage

2014-09-25 Thread Stephen Frost

* Peter Eisentraut (pete...@gmx.net) wrote:
 On 9/24/14 4:58 PM, Stephen Frost wrote:
  * Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
  I think the case for pgstat_get_backend_current_activity() and
  pg_stat_get_activity and the other pgstatfuncs.c callers is easy to make
  and seems acceptable to me; but I would leave pg_signal_backend out of
  that discussion, because it has a potentially harmful side effect.  By
  requiring SET ROLE you add an extra layer of protection against
  mistakes.  (Hopefully, pg_signal_backend() is not a routine thing for
  well-run systems, which means human intervention, and therefore the room
  for error isn't insignificant.)
  
  While I certainly understand where you're coming from, I don't really
  buy into it.  Yes, cancelling a query (the only thing normal users can
  do anyway- they can't terminate backends) could mean the loss of any
  in-progress work, but it's not like 'rm' and I don't see that it needs
  to require extra hoops for individuals to go through.
 
 It would be weird if it were inconsistent: some things require role
 membership, some things require SET ROLE.  Try explaining that.

I agree..  We already have that distinction, through inherit vs.
noinherit.  I don't think it makes sense to have it also for individual
commands which we feel might not be as safe.  You could still go
delete all their data w/o a set role if you wanted...

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-09-25 Thread Robert Haas

On Thu, Sep 25, 2014 at 3:17 PM, Peter Geoghegan p...@heroku.com wrote:
 To find out how much that optimization buys, you
 should use tuples with many variable-length columns (say, 50)
 preceding the text column you're sorting on. I won't be surprised if
 that turns out to be expensive enough to be worth worrying about, but
 I have not benchmarked it.

 Sorry, but I don't follow. I don't think the pertinent question is if
 it's a noticeable cost. I think the pertinent question is if it's
 worth it. Doing something about it necessitates a lot of extra memory
 access. Not doing something about it hardly affects the amount of
 memory access required, perhaps not at all.

I think you're mincing words.  If you go back and fix datum1, you'll
spend a bunch of effort doing that.If you don't, you'll pay a cost
on every comparison to re-find the relevant column inside each tuple.
You can compare those costs in a variety of cases, including the one I
mentioned, where the latter cost will be relatively high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-09-25 Thread Andres Freund

Hi,

On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
 Hi Andres, Robert.
 
 I've attached four patches here.

Cool. Will review.

 1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
earlier in StartupXLOG.
 
 2. Inside that function, issue fsync()s for the main forks we create by
copying the _init fork.

These two imo should definitely be backpatched.

 3. A small fixup to add a const to typedef char *FileName, because the
earlier patch gave me warnings about discarding const-ness. This is
consistent with many other functions in fd.c that take const char *.

I'm happy with consider that one for master (although I seem to recall having 
had
a patch for it rejected?), but I don't think we want to do that in the
backbranches.

 4. Issue an fsync() on the data directory at startup if we need to
perform crash recovery.

I personally think this one should be backpatched too. Does anyone
believe differently?

 From d8726c06cdf11674661eac1d091cf7edd05c2a0c Mon Sep 17 00:00:00 2001
 From: Abhijit Menon-Sen a...@2ndquadrant.com
 Date: Wed, 24 Sep 2014 14:43:18 +0530
 Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier
 
 We need to call this after recovery, but not after the END_OF_RECOVERY
 checkpoint is written. If we crash after that checkpoint, for example
 because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
 the ControlFile-state to DB_SHUTDOWNED, so we don't enter recovery
 again at startup.
 
 Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
 in the first cleanup, this leaves the database with a bunch of _init
 forks for unlogged relations, but no main forks, leading to scary
 errors.
 
 See thread from 20140912112246.ga4...@alap3.anarazel.de for details.

With a explanation. Shiny!

 ---
  src/backend/access/transam/xlog.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)
 
 diff --git a/src/backend/access/transam/xlog.c 
 b/src/backend/access/transam/xlog.c
 index 46eef5f..218f7fb 100644
 --- a/src/backend/access/transam/xlog.c
 +++ b/src/backend/access/transam/xlog.c
 @@ -6863,6 +6863,14 @@ StartupXLOG(void)
   ShutdownWalRcv();
  
   /*
 +  * Reset initial contents of unlogged relations.  This has to be done
 +  * AFTER recovery is complete so that any unlogged relations created
 +  * during recovery also get picked up.
 +  */
 + if (InRecovery)
 + ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
 +

+

 * Also recovery shouldn't be regarded successful if the reset fails -
 * e.g. because of ENOSPC.

 +
 + /*
 +  * copy_file() above has already called pg_flush_data() on the
 +  * files it created. Now we need to fsync those files, because
 +  * a checkpoint won't do it for us while we're in recovery.
 +  */

+

* Doing this in a separate pass is advantageous for performance reasons
* because it allows the kernel to perform all the flushes at once.

 + /*
 +  * If we need to perform crash recovery, we issue an fsync on the
 +  * data directory to try to ensure that any data written before the
 +  * crash are flushed to disk. Otherwise a power failure in the near
 +  * future might mean that earlier unflushed writes are lost, but the
 +  * more recent data written to disk from here on are persisted.
 +  */
 +
 + if (ControlFile-state != DB_SHUTDOWNED 
 + ControlFile-state != DB_SHUTDOWNED_IN_RECOVERY)
 + fsync_fname(data_directory, true);
 +
   if (ControlFile-state == DB_SHUTDOWNED)
   {
   /* This is the expected case, so don't be chatty in standalone 
 mode */
 -- 

Unless I miss something this isn't sufficient. We need to fsync the
files in the data directory, not just the toplevel directory?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Simon Riggs

On 25 September 2014 20:38, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Sep 25, 2014 at 7:12 AM, Simon Riggs si...@2ndquadrant.com wrote:
 The way forwards, in my view, is to define precisely the behaviour we
 wish to have. That definition will include the best current mechanism
 for running an UPSERT using INSERT/UPDATE/loops and comparing that
 against what is being provided here. We will then have a functional
 test of equivalence of the approaches, and a basis for making a
 performance test that shows that performance is increased without any
 loss of concurrency.

 That sounds very reasonable.

So I promise not to discuss locking until we get the first things done.

My suggested approach to get this committed is...

A. UPDATE/INSERT privilege infrastructure.
Add tests to it, make it separately committable, so we can get that done.
Submit to Oct CF; get that done early.

B. Agree command semantics by producing these things
* Explanatory documentation (Ch6.4 Data Manipulation - Upsert)
* SQL Reference Documentation (INSERT)
* Test cases for feature
* Test cases for concurrency
* Test cases for pgbench

All of the above, as a separate committable patch. I hate the fact
that you have written no user facing documentation for this feature.
How can anyone tell whether the tests you've written are correct or
even consistent to a particular definition of correctness?
Submit as patch for review only to Oct 15 CF
We then agree what is required for further work
At this stage, poll the Django and Rails communities for acceptance
and early warning of these features. Listen.

C. Internal weirdness
Submit C based upon earlier agreed B, submit to Dec 15 CF, major patch
deadline, so we can fine tune for last CF.
Then Heikki rewrites half your patch in a better way, you thank him
and then we commit. All done.

 While I'm sure that what I have here can
 decisively beat the xact looping pattern in terms of performance as
 measured by pgbench, the real performance advantage is that this
 approach doesn't burn through XIDs. That was a concern that Andres
 highlighted in relation to using the subxact looping pattern with
 BDR's multi-master replication conflict resolution.

But we're still discussing SQL semantics. So first things first, then
loop back around, hoping our design has not been concurrently
deleted...

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: rounding up time value less than its unit.

2014-09-25 Thread Stephen Frost

* Gregory Smith (gregsmithpg...@gmail.com) wrote:
 On 9/25/14, 2:02 PM, Peter Eisentraut wrote:
 But having the same parameter setting mean different things in
 different versions is the path to complete madness.
 
 Could we go so far as to remove support for unitless time settings
 eventually?  The fact that people are setting raw numbers in the
 configuration file and have to know the unit to understand what they
 just did has never been something I like.

I could certainly get behind that idea...  Tho I do understand that
people will complain about backwards compatibility, etc, etc.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 1:21 PM, Simon Riggs si...@2ndquadrant.com wrote:
 The test index is unique, so our to-be-inserted value exists on only
 one page, hence page locking applies while we insert it. The next
 person to insert waits for the page lock and then sees the test tuple.

 The page lock lasts only for the duration of the insertion of the
 ituple, not for the whole operation.

(by page lock, I take it you mean buffer lock - converting that into a
page hwlock is what I do).

This is where it gets quite complicated. What happens if row locking
on upsert finds a conflict update changing uniquely-constrained
attributes? Sure, a vanilla non-HOT update will fail on inserting a
unique index tuple, but *it* can still cause us a row-level conflict,
and *it* can only fail (with a dup violation) when we commit/abort.
But now we're obligated to wait on it to get the row lock, and it's
obligated to wait on us to get the promise tuple lock, or any other
sort of value lock that hasn't already been released when we go to
row lock. Deadlock.

You cannot get away with failing to release the promise tuple/value
lock if you want to maintain useful guarantees.

It doesn't need to be a vanilla non-HOT update.  That's just the
simplest example I can think of.

 Also, I think there are other cases where we think we're going to
 insert, so we put the special index tuple in there, but then we decide
 to update, so we don't need the promise tuple any more, but other
 sessions are potentially still waiting for our XID to terminate even
 though there's no conflict any more.  I'm having a hard time bringing
 the details of those cases to mind ATM, though.

 We make the decision to INSERT or UPDATE based upon what we find in
 the test index. If a value if there already, we assume its an UPDATE
 and go to update the row this points to. If it has been deleted we
 loop back and try again/error.

Sure, you can throw an error, and that makes things a lot easier. It
also implies that the implementation is inferior to the subxact
looping pattern, which you've already implied is a thing we must beat
in every way. Frankly, I think it's a cop-out to just throw an error,
and I don't think it'll end up being some theoretical risk. It'll
happen often if it is allowed to happen at all. Allowing it to happen
almost defeats the purpose of the feature - the big appeal of the
feature is that it makes guarantees about the outcome.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Immediate standby promotion

2014-09-25 Thread Simon Riggs

On 25 September 2014 18:30, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-09-25 18:18:09 +0100, Simon Riggs wrote:
 On 25 September 2014 16:29, Andres Freund and...@2ndquadrant.com wrote:
  I think that's not really related. Such a promotion doesn't cause data
  loss in the sense of loosing data a *clueful* operator wanted to
  keep. Yes, it can be used wrongly, but it's far from alone in that.

 Yes it does cause data loss. The clueful operator has no idea where
 they are so there is no used rightly in that case.

 What? There definitely are cases where you don't need to know that to
 the T. Just think of the - quite frequently happening - need to promote
 a standby to run tests or reporting queries that can't be run on a
 standby.

What do they do with the standby afterwards?

Perhaps for testing, but I'd hope that Business Intelligence is done
by freezing databases at known target times. So at least you can say,
using a database snapshot of 9am, we had the following results.

We seem to be trying to justify something that is dangerous and will
destroy data for incautious users. Of course it has uses, but thats
not the point, its the danger that is the problem, not the lack of
use. We go to a lot of trouble to avoid footguns elsewhere across many
years, so I can't see why you'd want to have the --footgun option
added here. recovery-target = 'vague'


 Sure, you shouldn't use it if you expect a very specific set of the data
 being there, but that's not always necessary. And that's why it should
 never, ever be the default.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-09-25 Thread Abhijit Menon-Sen

At 2014-09-25 22:41:18 +0200, and...@2ndquadrant.com wrote:

  * Also recovery shouldn't be regarded successful if the reset fails -
  * e.g. because of ENOSPC.

OK.

 * Doing this in a separate pass is advantageous for performance reasons
 * because it allows the kernel to perform all the flushes at once.

OK.

 Unless I miss something this isn't sufficient. We need to fsync the
 files in the data directory, not just the toplevel directory?

No, of course you're right. So a separate function that does the moral
equivalent of find $PGDATA -exec fsync_fname …?

Will resubmit with the additional comments.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-09-25 Thread Andres Freund

On 2014-09-26 02:34:06 +0530, Abhijit Menon-Sen wrote:
 At 2014-09-25 22:41:18 +0200, and...@2ndquadrant.com wrote:
  Unless I miss something this isn't sufficient. We need to fsync the
  files in the data directory, not just the toplevel directory?
 
 No, of course you're right. So a separate function that does the moral
 equivalent of find $PGDATA -exec fsync_fname …?

Probably will require some care to deal correctly with tablespaces.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-09-25 Thread Peter Geoghegan

On Thu, Sep 25, 2014 at 1:48 PM, Simon Riggs si...@2ndquadrant.com wrote:
A. UPDATE/INSERT privilege infrastructure.
Add tests to it, make it separately committable, so we can get that done.
Submit to Oct CF; get that done early.

Makes sense. As long as we assume that we want a unified syntax like
this - that is, that we need something vaguely insert-update or
update-insertish - then we need this. Unfortunately, we cannot add
regression tests for this without almost the full patch set.

B. Agree command semantics by producing these things
* Explanatory documentation (Ch6.4 Data Manipulation - Upsert)
* SQL Reference Documentation (INSERT)
* Test cases for feature
* Test cases for concurrency
* Test cases for pgbench

Okay. I do have stress-tests, that are separately maintained, in case
you missed that:

https://github.com/petergeoghegan/upsert

All of the above, as a separate committable patch. I hate the fact
that you have written no user facing documentation for this feature.
How can anyone tell whether the tests you've written are correct or
even consistent to a particular definition of correctness?

I'd hoped that the commit messages, and my discussion of the feature
were adequate.

Submit as patch for review only to Oct 15 CF
We then agree what is required for further work
At this stage, poll the Django and Rails communities for acceptance
and early warning of these features. Listen.

I know an original founder of the Django project quite well - Jacob
Kaplan-Moss (a co-worker - the guy that keynoted pgOpen in its second
year). He is very interested in this effort.

C. Internal weirdness
Submit C based upon earlier agreed B, submit to Dec 15 CF, major patch
deadline, so we can fine tune for last CF.
Then Heikki rewrites half your patch in a better way, you thank him
and then we commit. All done.

I don't have a problem with Heikki or anyone else rewriting the value
locking part of the patch, provided it meets my requirements for such
a mechanism. Since Heikki already agreed that that standard should be
imposed, he'd hardly take issue with it now.

However, the fact is that once you actually make something like
promise tuples meet that standard, at the very least it becomes a lot
messier than you'd think. Heikki's final prototype super deleted
tuples by setting their xmin to InvalidTransactionId. We weren't sure
that that doesn't break some random other heapam code. Consider this,
for example:

https://github.com/postgres/postgres/blob/REL9_4_STABLE/src/backend/executor/execMain.c#L1961

So that looks safe in the face of setting xmin to InvalidTransactionId
in the way the later prototype patch did if you think about it for a
while, but there are other places where that is less clear. In short,
it becomes something that we have to worry about for ever, because
xmin cannot change without the tuple in the slot changing is clearly
an invariant for certain purposes. It might accidentally fail to fail
right now, but I'm not comfortable with it.

Now, I might be convinced that that's actually the way to go. I have
an open mind. But that will take discussion. I like that page
hwlocking is something that many systems do (even including Oracle, I
believe). Making big changes to nbtree is always something that
deserves to be met with skepticism, but it is nice to have an
implementation that lives in the head of AM.

Sorry, I forgot to not talk about locking.

But we're still discussing SQL semantics. So first things first, then
loop back around, hoping our design has not been concurrently
deleted...

I hope the discussion can avoid unprincipled deadlocks

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 118 matches

Mail list logo