Re: [HACKERS] new --maintenance-db options

2012-06-26 Thread Amit Kapila
From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
Robert Haas robertmh...@gmail.com writes:
 From pg_upgrade's perspective, it would
 be nice to have a flag that starts the server in some mode where
 nobody but pg_upgrade can connect to it and all connections are
 automatically allowed, but it's not exactly clear how to implement
 nobody but pg_upgrade can connect to it.

 The implementation I've wanted to see for some time is that you can
 start a standalone backend, but it speaks FE/BE protocol to its caller
 (preferably over pipes, so that there is no issue whatsoever of where
 you can securely put a socket or anything like that).  

Can't it be done like follow the FE/BE protocol, but call directly the
server API's 
at required places. 
This kind of implementation can be more performant than adding any
communication to it which will be
beneficial for embedded databases.

 Making that
 happen might be a bit too much work if pg_upgrade were the only use
 case, but there are a lot of people who would like to use PG as an
 embedded database, and this might be close enough for such use-cases.

Seeing PG to run as embedded database would be interesting for many people
using PG.
There is another use case of embedded databases that they allow another
remote connections
as well to monitor the operations in database. However that can be done in a
later version of implementation.

With Regards,
Amit Kapila.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Magnus Hagander
On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 Hello

 I worked on simple patch, that enable access from server side to
 client side data. It add two new hooks to libpq - one for returning of
 local context, second for setting of local context.

 A motivation is integration of possibilities of psql console together
 with stronger language - plpgsql. Second target is enabling
 possibility to save a result of some server side process in psql. It
 improve vars feature in psql.

 pavel ~/src/postgresql/src $ cat test.sql
 \echo value of external paremeter is :myvar

 do $$
 begin
  -- we can take any session variable on client side
  -- it is safe against to SQL injection
  raise notice 'external parameter accessed from plpgsql is %',
 hgetvar('myvar');

  -- we can change this session variable and finish transaction
  perform hsetvar('myvar', 'Hello, World');
 end;
 $$ language plpgsql;

 \echo new value of session variable is :myvar

 cat test.sql | psql postgres -v myvar=Hello
 value of external paremeter is Hello
 NOTICE:  external parameter accessed from plpgsql is Hello
 DO
 new value of session variable is Hello, World

 This is just proof concept - there should be better integration with
 pl languages, using cache for read on server side, ...

 Notices?

Why not just use a custom GUC variable instead? E.g. you could have
psql SET psql.myvar='Hello, World', and then you'd need no changes
at all in the backend? Maybe have a shorthand interface for
accessing GUCs in psql would help in making it easier, but do we
really need a whole new variable concept?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Pavel Stehule
2012/6/26 Magnus Hagander mag...@hagander.net:
 On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Hello

 I worked on simple patch, that enable access from server side to
 client side data. It add two new hooks to libpq - one for returning of
 local context, second for setting of local context.

 A motivation is integration of possibilities of psql console together
 with stronger language - plpgsql. Second target is enabling
 possibility to save a result of some server side process in psql. It
 improve vars feature in psql.

 pavel ~/src/postgresql/src $ cat test.sql
 \echo value of external paremeter is :myvar

 do $$
 begin
  -- we can take any session variable on client side
  -- it is safe against to SQL injection
  raise notice 'external parameter accessed from plpgsql is %',
 hgetvar('myvar');

  -- we can change this session variable and finish transaction
  perform hsetvar('myvar', 'Hello, World');
 end;
 $$ language plpgsql;

 \echo new value of session variable is :myvar

 cat test.sql | psql postgres -v myvar=Hello
 value of external paremeter is Hello
 NOTICE:  external parameter accessed from plpgsql is Hello
 DO
 new value of session variable is Hello, World

 This is just proof concept - there should be better integration with
 pl languages, using cache for read on server side, ...

 Notices?

 Why not just use a custom GUC variable instead? E.g. you could have
 psql SET psql.myvar='Hello, World', and then you'd need no changes
 at all in the backend? Maybe have a shorthand interface for
 accessing GUCs in psql would help in making it easier, but do we
 really need a whole new variable concept?

GUC variables doesn't help with access to psql's command line
parameters from DO PL code.

Regards

Pavel




 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Boszormenyi Zoltan

2012-06-26 06:59 keltezéssel, Alvaro Herrera írta:

I cleaned up the framework patch a bit.  My version's attached.  Mainly,
returning false for failure in some code paths that are only going to
have the caller elog(FATAL) is rather pointless -- it seems much better
to just have the code itself do the elog(FATAL).  In any case, I
searched for reports of those error messages being reported in the wild
and saw none.


OK. The cleanups are always good.

One nitpick, though. Your version doesn't contain the timeout.h
and compilation fails because of it. :-) You might have forgotten
to do git commit -a.


There are other things but they are in the nitpick department.  (A
reference to -check_timeout remains that needs to be fixed too).


Yes, it's called -timeout_func() now.


There is one thing that still bothers me on this whole framework patch,
but I'm not sure it's easily fixable.  Basically the API to timeout.c is
the whole list of timeouts and their whole behaviors.  If you want to
add a new one, you can't just call into the entry points, you have to
modify timeout.c itself (as well as timeout.h as well as whatever code
it is that you want to add timeouts to).  This may be good enough, but I
don't like it.  I think it boils down to proctimeout.h is cheating.

The interface I would actually like to have is something that lets the
modules trying to get a timeout register the timeout-checking function
as a callback.  API-wise, it would be much cleaner.  However, I'm not
raelly sure that code-wise it would be any cleaner at all.  In fact I
think it'd be rather cumbersome.  However, if you could give that idea
some thought, it'd be swell.


Well, I can make the registration interface similar to how LWLocks
are treated, but that doesn't avoid modification of the base_timeouts
array in case a new internal use case arises. Say:

#define USER_TIMEOUTS4

intn_timeouts = TIMEOUT_MAX;
static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS];

and register_timeout() adds data after TIMEOUT_MAX.


I haven't looked at the second patch at all yet.


This is the whole point of the first patch. But as I said above for
registering a new timeout source, it's a new internal use case.
One that touches a lot of places which cannot simply be done
by registering a new timeout source.

--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Magnus Hagander
On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2012/6/26 Magnus Hagander mag...@hagander.net:
 On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Hello

 I worked on simple patch, that enable access from server side to
 client side data. It add two new hooks to libpq - one for returning of
 local context, second for setting of local context.

 A motivation is integration of possibilities of psql console together
 with stronger language - plpgsql. Second target is enabling
 possibility to save a result of some server side process in psql. It
 improve vars feature in psql.

 pavel ~/src/postgresql/src $ cat test.sql
 \echo value of external paremeter is :myvar

 do $$
 begin
  -- we can take any session variable on client side
  -- it is safe against to SQL injection
  raise notice 'external parameter accessed from plpgsql is %',
 hgetvar('myvar');

  -- we can change this session variable and finish transaction
  perform hsetvar('myvar', 'Hello, World');
 end;
 $$ language plpgsql;

 \echo new value of session variable is :myvar

 cat test.sql | psql postgres -v myvar=Hello
 value of external paremeter is Hello
 NOTICE:  external parameter accessed from plpgsql is Hello
 DO
 new value of session variable is Hello, World

 This is just proof concept - there should be better integration with
 pl languages, using cache for read on server side, ...

 Notices?

 Why not just use a custom GUC variable instead? E.g. you could have
 psql SET psql.myvar='Hello, World', and then you'd need no changes
 at all in the backend? Maybe have a shorthand interface for
 accessing GUCs in psql would help in making it easier, but do we
 really need a whole new variable concept?

 GUC variables doesn't help with access to psql's command line
 parameters from DO PL code.

But with a small change to psql they could, without the need for a
whole new type of variable. For example, psql could set all those
variable as psql.commandlinevarname, which could then be accessed
from the DO PL code just fine.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Pavel Stehule
2012/6/26 Magnus Hagander mag...@hagander.net:
 On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 2012/6/26 Magnus Hagander mag...@hagander.net:
 On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 Hello

 I worked on simple patch, that enable access from server side to
 client side data. It add two new hooks to libpq - one for returning of
 local context, second for setting of local context.

 A motivation is integration of possibilities of psql console together
 with stronger language - plpgsql. Second target is enabling
 possibility to save a result of some server side process in psql. It
 improve vars feature in psql.

 pavel ~/src/postgresql/src $ cat test.sql
 \echo value of external paremeter is :myvar

 do $$
 begin
  -- we can take any session variable on client side
  -- it is safe against to SQL injection
  raise notice 'external parameter accessed from plpgsql is %',
 hgetvar('myvar');

  -- we can change this session variable and finish transaction
  perform hsetvar('myvar', 'Hello, World');
 end;
 $$ language plpgsql;

 \echo new value of session variable is :myvar

 cat test.sql | psql postgres -v myvar=Hello
 value of external paremeter is Hello
 NOTICE:  external parameter accessed from plpgsql is Hello
 DO
 new value of session variable is Hello, World

 This is just proof concept - there should be better integration with
 pl languages, using cache for read on server side, ...

 Notices?

 Why not just use a custom GUC variable instead? E.g. you could have
 psql SET psql.myvar='Hello, World', and then you'd need no changes
 at all in the backend? Maybe have a shorthand interface for
 accessing GUCs in psql would help in making it easier, but do we
 really need a whole new variable concept?

 GUC variables doesn't help with access to psql's command line
 parameters from DO PL code.

 But with a small change to psql they could, without the need for a
 whole new type of variable. For example, psql could set all those
 variable as psql.commandlinevarname, which could then be accessed
 from the DO PL code just fine.

yes, it is possibility too. It has different issues - it can send
unwanted variables - maybe some compromise is optimum.




 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVIEW: Optimize referential integrity checks (todo item)

2012-06-26 Thread Vik Reykja
On Wed, Jun 20, 2012 at 2:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I've marked this patch committed, although in the end there was nothing
 left of it ;-)


Thank you, Dean and Tom!

I'm sorry for not participating in this thread, I've been away for the past
five weeks and have much catching up to do.


Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables

2012-06-26 Thread Etsuro Fujita
Hi Kaigai-san,

 -Original Message-
 From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp]
 Sent: Monday, June 25, 2012 9:49 PM
 To: Etsuro Fujita
 Cc: Robert Haas; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
 foreign tables
 
 Fujita-san,
 
 The revised patch is almost good for me.
 Only point I noticed is the check for redundant or duplicated options I
pointed
 out on the previous post.
 So, if you agree with the attached patch, I'd like to hand over this patch for
 committers.

OK I agree with you.  Thanks for the revision!

Besides the revision, I modified check_selective_binary_conversion() to run
heap_close() in the whole-row-reference case.  Attached is an updated version of
the patch.

Thanks.

Best regards,
Etsuro Fujita

 All I changed is:
  --- a/src/backend/commands/copy.c
  +++ b/src/backend/commands/copy.c
  @@ -122,6 +122,11 @@ typedef struct CopyStateData @@ -217,7 +221,7 @@ index
 98bcb2f..0143d81 100644
 }
  +  else if (strcmp(defel-defname, convert_binary) == 0)
  +  {
 -+  if (cstate-convert_binary)
 ++  if (cstate-convert_selectively)
  +  ereport(ERROR,
  +  (errcode(ERRCODE_SYNTAX_ERROR),
  +   errmsg(conflicting or redundant options)));
 
 Thanks,
 
 2012/6/20 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
  Hi KaiGai-san,
 
  Thank you for the review.
 
  -Original Message-
  From: pgsql-hackers-ow...@postgresql.org
  [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
  Sent: Wednesday, June 20, 2012 1:26 AM
  To: Etsuro Fujita
  Cc: Robert Haas; pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV
  file foreign tables
 
  Hi Fujita-san,
 
  Could you rebase this patch towards the latest tree?
  It was unavailable to apply the patch cleanly.
 
  Sorry, I updated the patch.  Please find attached an updated version
  of the patch.
 
  I looked over the patch, then noticed a few points.
 
  At ProcessCopyOptions(), defel-arg can be NIL, isn't it?
  If so, cstate-convert_binary is not a suitable flag to check
  redundant option. It seems to me cstate-convert_selectively is more
  suitable flag to check it.
 
  +       else if (strcmp(defel-defname, convert_binary) == 0)
  +       {
  +           if (cstate-convert_binary)
  +               ereport(ERROR,
  +                       (errcode(ERRCODE_SYNTAX_ERROR),
  +                        errmsg(conflicting or redundant
  + options)));
  +           cstate-convert_selectively = true;
  +           if (defel-arg == NULL || IsA(defel-arg, List))
  +               cstate-convert_binary = (List *) defel-arg;
  +           else
  +               ereport(ERROR,
  +                       (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  +                        errmsg(argument to option \%s\ must be a
  list of column names,
  +                               defel-defname)));
  +       }
 
  Yes, defel-arg can be NIL.  defel-arg is a List structure listing
  all the columns needed to be converted to binary representation, which
  is NIL for the case where no columns are needed to be converted.  For
  example,
  defel-arg is NIL for SELECT COUNT(*).  In this case, while
  cstate-convert_selectively is set to true, no columns are converted
  cstate-at
  NextCopyFrom().  Most efficient case!  In short,
  cstate-convert_selectively represents whether to do selective binary
  conversion at NextCopyFrom(), and
  cstate-convert_binary represents all the columns to be converted at
  NextCopyFrom(), that can be NIL.
 
  At NextCopyFrom(), this routine computes default values if configured.
  In case when these values are not referenced, it might be possible to
  skip unnecessary calculations.
  Is it unavailable to add logic to avoid to construct cstate-defmap
  on unreferenced columns at  BeginCopyFrom()?
 
  I think that we don't need to add the above logic because file_fdw
  does
  BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom()
  doesn't construct cstate-defmap at all.
 
  I fixed a bug plus some minor optimization in
  check_binary_conversion() that is renamed to
  check_selective_binary_conversion() in the updated version, and now
  file_fdw gives up selective binary conversion for the following
  cases:
 
   a) BINARY format
   b) CSV/TEXT format and whole row reference
   c) CSV/TEXT format and all the user attributes needed
 
 
  Best regards,
  Etsuro Fujita
 
  Thanks,
 
  2012/5/11 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
   -Original Message-
   From: Robert Haas [mailto:robertmh...@gmail.com]
   Sent: Friday, May 11, 2012 1:36 AM
   To: Etsuro Fujita
   Cc: pgsql-hackers@postgresql.org
   Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of
   CSV file foreign tables
  
   On Tue, May 8, 2012 at 7:26 AM, Etsuro Fujita
   fujita.ets...@lab.ntt.co.jp
   wrote:
I would like to propose 

Re: [HACKERS] Backport of fsync queue compaction

2012-06-26 Thread Robert Haas
On Tue, Jun 19, 2012 at 5:39 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 19, 2012 at 5:33 PM, Greg Smith g...@2ndquadrant.com wrote:
 In January of 2011 Robert committed 7f242d880b5b5d9642675517466d31373961cf98
 to try and compact the fsync queue when clients find it full.  There's no
 visible behavior change, just a substantial performance boost possible in
 the rare but extremely bad situations where the background writer stops
 doing fsync absorption.  I've been running that in production at multiple
 locations since practically the day it hit this mailing list, with backports
 all the way to 8.3 being common (and straightforward to construct).  I've
 never seen a hint of a problem with this new code.

 I've been in favor of back-porting this for a while, so you'll get no
 argument from me.

 Anyone disagree?

Hearing no disagreement, I went ahead and did this.  I didn't take
Greg Smith's suggestion of adding a log message when/if the fsync
compaction logic fails to make any headway, because (1) the proposed
message didn't follow message style guidelines and I couldn't think of
a better one that did and (2) I'm not sure it's worth creating extra
translation work in the back-branches for such a marginal case.  We
can revisit this if people feel strongly about it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 3:59 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Well, I can make the registration interface similar to how LWLocks
 are treated, but that doesn't avoid modification of the base_timeouts
 array in case a new internal use case arises. Say:

 #define USER_TIMEOUTS    4

 int    n_timeouts = TIMEOUT_MAX;
 static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS];

Since timeouts - unlike lwlocks - do not need to touch shared memory,
there's no need for a hard-coded limit here.  You can just allocate
the array using MemoryContextAlloc(TopMemoryContext, ...) and enlarge
it as necessary.  To avoid needing to modify the base_timeouts array,
you can just have internal callers push their entries into the array
during process startup using the same function call that an external
module would use.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Andres Freund
On Monday, June 25, 2012 08:50:54 PM Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  We most particularly *don't* want DDL to replicate automatically,
  because the schema changes are deployed along with related
  software changes, and we like to pilot any changes for at least a
  few days.  Depending on the release, the rollout may take a
  couple months, or we may slam in out everywhere a few days after
  the first pilot deployment.
  
  Thats a sensible for your use-case - but I do not think its thats
  the appropriate behaviour for anything which is somewhat
  out-of-the box...

 Right.  We currently consider the issues involved in a change and
 script it by hand.  I think we want to continue to do that.  The
 point was that, given the variety of timings and techniques for
 distributing schema changes, maybe is was only worth doing that
 automatically for the most constrained and controlled cases.
Agreed.

  So you could certainly punt all of this for any release as far as
  Wisconsin Courts are concerned.  We need to know table and column
  names, before and after images, and some application-supplied
  metadata.
  
  I am not sure were going to get all that into 9.3.
 
 Sure, that was more related to why I was questioning how much these
 use cases even *could* integrate -- whether it even paid to
 *consider* these use cases at this point.  Responses from Robert and
 you have pretty much convinced me that I was being overly
 pessimistic on that.
I think its an important question to ask, otherwise we might just end up with 
infrastructure unusable for all our goals... Or usable but unfinished 
infrastructure because its to complex to build in sensible time.

 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in both
 the BEFORE and AFTER tuple images, as long as we have the null
 column bitmaps -- or some other way of distinguishing unchanged from
 NULL.  (This could be especially important when the unchanged column
 was a 50 MB bytea.)  It doesn't matter to me which way this is
 optimized -- in our trigger-based system we chose to keep the full
 BEFORE tuple and just show AFTER values which were distinct from the
 corresponding BEFORE values, but it would be trivial to adapt the
 code to the other way.
I don't think doing that is worth the trouble in the first incarnation. There 
is enough detail hidden in that that makes that non-trivial that I wouldn't 
worry about it until the rest of the infrastructure exists. That part of the 
code will definitely be version specific so I see no problem improving on that 
incrementally.

 Sorry for that bout of pessimism.
Oh, no reason for that. I have some doubts about that myself, so...

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Boszormenyi Zoltan

2012-06-26 13:50 keltezéssel, Robert Haas írta:

On Tue, Jun 26, 2012 at 3:59 AM, Boszormenyi Zoltan z...@cybertec.at wrote:

Well, I can make the registration interface similar to how LWLocks
are treated, but that doesn't avoid modification of the base_timeouts
array in case a new internal use case arises. Say:

#define USER_TIMEOUTS4

intn_timeouts = TIMEOUT_MAX;
static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS];

Since timeouts - unlike lwlocks - do not need to touch shared memory,
there's no need for a hard-coded limit here.  You can just allocate
the array using MemoryContextAlloc(TopMemoryContext, ...) and enlarge
it as necessary.  To avoid needing to modify the base_timeouts array,
you can just have internal callers push their entries into the array
during process startup using the same function call that an external
module would use.


I know, but it doesn't feel right to register static functionality.

--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics

2012-06-26 Thread Satoshi Nagayasu
Hi all,

I've modified the pg_stat_lwlocks patch to be able to work with
the latest PostgreSQL Git code.

This patch provides:
  pg_stat_lwlocks   New system view to show lwlock statistics.
  pg_stat_get_lwlocks() New function to retrieve lwlock statistics.
  pg_stat_reset_lwlocks()   New function to reset lwlock statistics.

Please try it out.

Regards,

2012/06/26 5:29, Satoshi Nagayasu wrote:
 Hi all,
 
 I've been working on a new system view, pg_stat_lwlocks, to observe
 LWLock, and just completed my 'proof-of-concept' code that can work
 with version 9.1.
 
 Now, I'd like to know the possibility of this feature for future
 release.
 
 With this patch, DBA can easily determine a bottleneck around lwlocks.
 --
 postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
   lwlockid | calls  | waits | time_ms
 --++---+-
 49 | 193326 | 32096 |   23688
  8 |   3305 |   133 |1335
  2 | 21 | 0 |   0
  4 | 135188 | 0 |   0
  5 |  57935 | 0 |   0
  6 |141 | 0 |   0
  7 |  24580 | 1 |   0
  3 |   3282 | 0 |   0
  1 | 41 | 0 |   0
  9 |  3 | 0 |   0
 (10 rows)
 
 postgres=#
 --
 
 In this view,
'lwlockid' column represents LWLockId used in the backends.
'calls' represents how many times LWLockAcquire() was called.
'waits' represents how many times LWLockAcquire() needed to wait
within it before lock acquisition.
'time_ms' represents how long LWLockAcquire() totally waited on
a lwlock.
 
 And lwlocks that use a LWLockId range, such as BufMappingLock or
 LockMgrLock, would be grouped and summed up in a single record.
 For example, lwlockid 49 in the above view represents LockMgrLock
 statistics.
 
 Now, I know there are some considerations.
 
 (1) Performance
 
I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.
 
pgbench scores with the patch:
  tps = 900.906658 (excluding connections establishing)
  tps = 908.528422 (excluding connections establishing)
  tps = 903.900977 (excluding connections establishing)
  tps = 910.470595 (excluding connections establishing)
  tps = 909.685396 (excluding connections establishing)
 
pgbench scores without the patch:
  tps = 909.096785 (excluding connections establishing)
  tps = 894.868712 (excluding connections establishing)
  tps = 910.074669 (excluding connections establishing)
  tps = 904.022770 (excluding connections establishing)
  tps = 895.673830 (excluding connections establishing)
 
Of course, this experiment was not I/O bound, and the cache hit ratio
was99.9%.
 
 (2) Memory space
 
In this patch, I added three new members to LWLock structure
as uint64 to collect statistics.
 
It means that those members must be held in the shared memory,
but I'm not sure whether it's appropriate.
 
I think another possible option is holding those statistics
values in local (backend) process memory, and send them through
the stat collector process (like other statistics values).
 
 (3) LWLock names (or labels)
 
Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
not easy for DBA to determine actual lock type.
 
So, I want to show LWLock names (or labels), like 'WALWriteLock'
or 'LockMgrLock', but how should I implement it?
 
 Any comments?
 
 Regards,


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 7cc1d41..f832b45 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_lwlocks AS
+SELECT
+S.lwlockid,
+S.calls,
+S.waits,
+S.time_ms
+FROM pg_stat_get_lwlocks() AS S;
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c
index 95d4b37..2a2c197 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -32,6 +32,7 @@
 #include storage/proc.h
 #include storage/spin.h
 
+#include sys/time.h
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -46,6 +47,11 @@ typedef struct LWLock
PGPROC *head;   /* head of list of waiting 
PGPROCs */
PGPROC *tail;   /* tail of list of waiting 
PGPROCs */

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Simon Riggs
On 25 June 2012 17:42, Kevin Grittner kevin.gritt...@wicourts.gov wrote:

 This is clearly going to depend on the topology.  You would
 definitely want to try to replicate the DDL for the case on which
 Simon is focused (which seems to me to be essentially physical
 replication of catalogs with logical replication of data changes
 from any machine to all others).

Just to remove any doubt: I'm not trying to support a single use case.

The overall proposals include a variety of design patterns. Each of
those covers many reasons for doing it, but end up with same
architecture.

1) Single master replication, with options not possible with physical
2) Multimaster
3) Many to One: data aggregation
4) Online upgrade

I don't think it will be possible to support all of those in one
release. Each has different challenges.

3 and 4 will not be worked on until 9.4, unless someone else is
willing to work on them. That isn't meant to be harsh, just an
explanation of practical reality that I hope people can accept without
needing to argue it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: avoid heavyweight locking on hash metapage

2012-06-26 Thread Robert Haas
On Mon, Jun 25, 2012 at 11:05 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Mon, Jun 18, 2012 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote:

 Hmm.  That was actually a gloss I added on existing code to try to
 convince myself that it was safe; I don't think that the changes I
 made make that any more or less safe than it was before.

 Right, sorry.  I thought there was some strength reduction going on
 there as well.

 Thanks for the various explanations, they address my concerns.  I see
 that v2 applies over v1.

 I've verified performance improvements using 8 cores with my proposed
 pgbench -P benchmark, with a scale that fits in shared_buffers.
 It brings it most of the way, but not quite, up to the btree performance.


 I've marked this as ready for committer.

Thanks for the review; committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 8:03 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
 I know, but it doesn't feel right to register static functionality.

We do it elsewhere.  The overhead is pretty minimal compared to other
things we already do during startup, and avoiding the need for the
array to have a fixed-size seems worth it, IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new --maintenance-db options

2012-06-26 Thread Tom Lane
Amit Kapila amit.kap...@huawei.com writes:
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
 The implementation I've wanted to see for some time is that you can
 start a standalone backend, but it speaks FE/BE protocol to its caller
 (preferably over pipes, so that there is no issue whatsoever of where
 you can securely put a socket or anything like that).  

 Can't it be done like follow the FE/BE protocol, but call directly the
 server API's 
 at required places. 

That wouldn't be easier, nor cleaner, and it would open us up to
client-induced database corruption (from failure to follow APIs, crashes
in the midst of an operation, memory stomps, etc).  We decided long ago
that we would never support truly embedded operation in the sense of PG
executing in the client's process/address space.  I like the design
suggested above because it has many of the good properties of an
embedded database (in particular, no need to manage or contact a server)
but still keeps the client code at arm's length.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jun 26, 2012 at 8:03 AM, Boszormenyi Zoltan z...@cybertec.at wrote:
 I know, but it doesn't feel right to register static functionality.

 We do it elsewhere.  The overhead is pretty minimal compared to other
 things we already do during startup, and avoiding the need for the
 array to have a fixed-size seems worth it, IMHO.

It's not even clear that the array needs to be dynamically resizable (yet).
Compare for instance syscache invalidation callbacks, which have so far
gotten along fine with a fixed-size array to hold registrations.  It's
foreseeable that we'll need something better someday, but that day
hasn't arrived, and might never arrive.

I agree with the feeling that this patch isn't done if the core
timeout code has to know specifically about each usage.  We have that
situation already.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Robert Haas
On Mon, Jun 25, 2012 at 3:17 PM, Andres Freund and...@2ndquadrant.com wrote:
 I suppose the main reason we haven't done it already is that it
 increases the period of time during which we're using 2X the disk
 space.
 I find that an acceptable price if its optional. Making it such doesn't seem
 to be a problem for me.

+1.

 I think there is absolutely nothing wrong with doing extra things in
 ALTER TABLE when logical replication is enabled.  We've got code
 that's conditional on Hot Standby being enabled in many places in the
 system; why should logical replication be any different?  If we set
 the bar for logical replication at the system can't do anything
 differently when logical replication is enabled then I cheerfully
 submit that we are doomed.  You've already made WAL format changes to
 support logging the pre-image of the tuple, which is a hundred times
 more likely to cause a performance problem than any monkeying around
 we might want to do in ALTER TABLE.

 I am deeply skeptical that we need to look inside of transactions that
 do full-table rewrites.  But even if we do, I don't see that what I'm
 proposing precludes it.  For example, I think we could have ALTER
 TABLE emit WAL records specifically for logical replication that allow
 us to disentangle which tuple descriptor to use at which point in the
 transaction.  I don't see that that would even be very difficult to
 set up.
 Sorry, I was imprecise above: I have no problem doing some changes during
 ALTER TABLE if logical rep is enabled. I am worried though that to make that
 robust you would need loads of places that emit additional information:
 * ALTER TABLE
 * ALTER FUNCTIION
 * ALTER OPERATOR
 * ALTER/CREATE CAST
 * TRUNCATE
 * CLUSTER
 * ...

 I have the feeling that would we want to do that the full amount of required
 information would be rather high and end up being essentially the catalog.
 Just having an intermediate tupledesc doesn't help that much if you have e.g.
 record_out doing type lookups of its own.

 There also is the issue you have talked about before, that a user-type might
 depend on values in other tables. Unless were ready to break at least
 transactional behaviour for those for now...) I don't see how decoding outside
 of the transaction is ever going to be valid? I wouldn't have a big problem
 declaring that as broken for now...

I've been thinking about this a lot.  My thinking's still evolving
somewhat, but consider the following case.  A user defines a type T
with an input function I and and output function O.   They create a
table which uses type T and insert a bunch of data, which is duly
parsed using I; then, they replace I with a new input function I' and
O with a new output function O'.  Now, clearly, if we process the
inserts using the catalogs that were in effect at the time the inserts
we're done, we could theoretically get different output than if we use
the catalogs that were in effect after the I/O functions were
replaced.  But is the latter output wrong, or merely different?  My
first thought when we started talking about this was it's wrong, but
the more I think about it, the less convinced I am...

...because it can't possibly be right to suppose that it's impossible
to decode heap tuples using any catalog contents other than the ones
that were in effect at the time the tuples got inserted.  If that were
true, then we wouldn't be able to read a table after adding or
dropping a column, which of course we can.  It seems to me that it
might be sufficient to guarantee that we'll decode using the same
*types* that were in effect at the time the inserts happened.  If the
user yanks the rug out from under us by changing the type definition,
maybe we simply define that as a situation in which they get to keep
both pieces.  After all, if you replace the type definition in a way
that makes sensible decoding of the table impossible, you've pretty
much shot yourself in the foot whether logical replication enters the
picture or not.

If the enum case, for example, we go to great pains to make sure that
the table contents are always decodable not only under the current
version of SnapshotNow, but also any successor version.  We do that by
prohibiting ALTER TYPE .. ADD VALUE from running inside a transaction
block - because if we inserted a row into pg_enum and then inserted
dependent rows into some user table, a rollback could leave us with
rows that we can't decode.  For the same reason, we don't allow ALTER
TYPE .. DROP VALUE.  I think that we can infer a general principle
from this: while I/O functions may refer to catalog contents, they may
not do so in a way that could be invalidated by subsequent commits or
rollbacks.  If they did, they'd be breaking the ability of subsequent
SELECT statements to read the table.

An interesting case that is arguably an exception to this rule is that
regwhatever types, which will cheerfully output their value as an OID
if it can't be decoded to text, but will bail on 

Re: [HACKERS] pgsql_fdw in contrib

2012-06-26 Thread Kohei KaiGai
Harada-san,

I checked your patch, and had an impression that includes many
improvements from the previous revision that I looked at the last
commit fest.

However, I noticed several points to be revised, or investigated.

* It seems to me expected results of the regression test is not
  attached, even though test cases were included. Please add it.

* cleanup_connection() does not close the connection in case
  when this callback was invoked due to an error under sub-
  transaction. It probably makes problematic behavior.

  See the following steps to reproduce the matter:

postgres=# BEGIN;
BEGIN
postgres=# SELECT * FROM ft3;
 a |  b
---+-
 1 | aaa
 2 | bbb
 3 | ccc
 4 | ddd
 5 | eee
 6 | fff
(6 rows)

postgres=# SAVEPOINT sv_01;
SAVEPOINT

postgres=# SELECT * FROM ft3 WHERE 1 / (a - 4)  0;  -- should be
error on remote transaction
ERROR:  could not execute remote query
DETAIL:  ERROR:  division by zero

HINT:  SELECT a, b FROM public.t1 WHERE (((1 OPERATOR(pg_catalog./) (a
OPERATOR(pg_catalog.-) 4)) OPERATOR(pg_catalog.) 0))

postgres=# ROLLBACK TO SAVEPOINT sv_01;
ROLLBACK

postgres=# SELECT * FROM ft3;
ERROR:  could not execute EXPLAIN for cost estimation
DETAIL:  ERROR:  current transaction is aborted, commands ignored
until end of transaction block

HINT:  SELECT a, b FROM public.t1

  Once we got an error under the remote transaction, it eventually raises
  an error on local side, then cleanup_connection() should be invoked.
  But it ignores the error due to sub-transaction, thus, the remote transaction
  being already aborted is kept.
  I'd like to suggest two remedy.
  1. connections are closed, even if errors on sub-transaction.
  2. close the connection if PQexecParams() returns an error,
  on execute_query() prior to raise a local error.

  * Regarding to deparseSimpleSql(), it pulls attributes being referenced
from baserestrictinfo and reltargetlist using pull_var_clause().
Is it unavailable to use local_conds instead of baserestrictinfo?
We can optimize references to the variable being consumed at the
remote side only. All we need to transfer is variables referenced
in targetlist and local-clauses.
In addition, is pull_var_clause() reasonable to list up all the attribute
referenced at the both expression tree? It seems to be pull_varattnos()
is more useful API in this situation.

  * Regarding to deparseRelation(), it scans simple_rte_array to fetch
RangeTblEntry with relation-id of the target foreign table.
It does not match correct entry if same foreign table is appeared in
this list twice or more, like a case of self-join. I'd like to recommend
to use simple_rte_array[baserel-relid] instead.
In addition, it checks whether relkind is RELKIND_FOREIGN_TABLE,
or not. It seems to me this check should be Assert(), if routines of
pgsql_fdw is called towards regular relations.

  * Regarding to deparseVar(), is it unnecessary to check relkind of
the relation being referenced by Var node, isn't it?

  * Regarding to deparseBoolExpr(), compiler raised a warning
because op can be used without initialized.

  * Even though it is harmless, sortConditions() is a misleading function
name. How about categolizeConditions() or screeningConditions()?

Thanks for your great jobs.

2012/6/14 Shigeru HANADA shigeru.han...@gmail.com:
 I'd like to propose pgsql_fdw, FDW for PostgreSQL, as a contrib module
 in core, again.  This patch is basically rebased version of the patches
 proposed in 9.2 development cycle, and contains some additional changes
 such as concern about volatile variables for PG_CATCH blocks.  In
 addition, I combined old three patches (pgsql_fdw core functionality,
 push-down support, and analyze support) into one patch for ease of
 review. (I think these features are necessary for pgsql_fdw use.)

 After the development for 9.2 cycle, pgsql_fdw can gather statistics
 from remote side by providing sampling function to analyze module in
 backend core, use them to estimate selectivity of WHERE clauses which
 will be pushed-down, and retrieve query results from remote side with
 custom row processor which prevents memory exhaustion from huge result set.

 It would be possible to add some more features, such as ORDER BY
 push-down with index information support, without changing existing
 APIs, but at first add relatively simple pgsql_fdw and enhance it seems
 better.  In addition, once pgsql_fdw has been merged, it would help
 implementing proof-of-concept of SQL/MED-related features.

 Regards,
 --
 Shigeru HANADA


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling

2012-06-26 Thread Robert Haas
On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund and...@2ndquadrant.com wrote:
 Can you elaborate on that a bit?  What scenarios did you play around
 with, and what does win mean in this context?
 I had two machines connected locally and setup HS and my prototype between
 them (not at once obviously).
 The patch reduced all the average latency between both nodes (measured by
 'ticker' rows arriving in a table on the standby), the jitter in latency and
 the amount of load I had to put on the master before the standby couldn't keep
 up anymore.

 I played with different loads:
 * multple concurrent ~50MB COPY's
 * multple concurrent ~50MB COPY's, pgbench
 * pgbench

 All three had a ticker running concurrently with synchronous_commit=off
 (because it shouldn't cause any difference in the replication pattern itself).

 The difference in averagelag and cutoff were smallest with just pgbench 
 running
 alone and biggest with COPY running alone. Highjitter was most visible with
 just pgbench running alone but thats likely just because the average lag was
 smaller.

OK, that sounds pretty promising.  I'd like to run a few performance
tests on this just to convince myself that it doesn't lead to a
significant regression in other scenarios.  Assuming that doesn't turn
up anything major, I'll go ahead and commit this.

Can you provide a rebased version?  It seems that one of the hunks in
xlog.c no longer applies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables

2012-06-26 Thread Kohei KaiGai
2012/6/26 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
 Hi Kaigai-san,

 -Original Message-
 From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp]
 Sent: Monday, June 25, 2012 9:49 PM
 To: Etsuro Fujita
 Cc: Robert Haas; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
 foreign tables

 Fujita-san,

 The revised patch is almost good for me.
 Only point I noticed is the check for redundant or duplicated options I
 pointed
 out on the previous post.
 So, if you agree with the attached patch, I'd like to hand over this patch 
 for
 committers.

 OK I agree with you.  Thanks for the revision!

 Besides the revision, I modified check_selective_binary_conversion() to run
 heap_close() in the whole-row-reference case.  Attached is an updated version 
 of
 the patch.

Sorry, I overlooked this code path. It seems to me fair enough.

So, I'd like to take over this patch for committers.

Thanks,

 Thanks.

 Best regards,
 Etsuro Fujita

 All I changed is:
  --- a/src/backend/commands/copy.c
  +++ b/src/backend/commands/copy.c
  @@ -122,6 +122,11 @@ typedef struct CopyStateData @@ -217,7 +221,7 @@ index
 98bcb2f..0143d81 100644
         }
  +      else if (strcmp(defel-defname, convert_binary) == 0)
  +      {
 -+          if (cstate-convert_binary)
 ++          if (cstate-convert_selectively)
  +              ereport(ERROR,
  +                      (errcode(ERRCODE_SYNTAX_ERROR),
  +                       errmsg(conflicting or redundant options)));

 Thanks,

 2012/6/20 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
  Hi KaiGai-san,
 
  Thank you for the review.
 
  -Original Message-
  From: pgsql-hackers-ow...@postgresql.org
  [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
  Sent: Wednesday, June 20, 2012 1:26 AM
  To: Etsuro Fujita
  Cc: Robert Haas; pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV
  file foreign tables
 
  Hi Fujita-san,
 
  Could you rebase this patch towards the latest tree?
  It was unavailable to apply the patch cleanly.
 
  Sorry, I updated the patch.  Please find attached an updated version
  of the patch.
 
  I looked over the patch, then noticed a few points.
 
  At ProcessCopyOptions(), defel-arg can be NIL, isn't it?
  If so, cstate-convert_binary is not a suitable flag to check
  redundant option. It seems to me cstate-convert_selectively is more
  suitable flag to check it.
 
  +       else if (strcmp(defel-defname, convert_binary) == 0)
  +       {
  +           if (cstate-convert_binary)
  +               ereport(ERROR,
  +                       (errcode(ERRCODE_SYNTAX_ERROR),
  +                        errmsg(conflicting or redundant
  + options)));
  +           cstate-convert_selectively = true;
  +           if (defel-arg == NULL || IsA(defel-arg, List))
  +               cstate-convert_binary = (List *) defel-arg;
  +           else
  +               ereport(ERROR,
  +                       (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  +                        errmsg(argument to option \%s\ must be a
  list of column names,
  +                               defel-defname)));
  +       }
 
  Yes, defel-arg can be NIL.  defel-arg is a List structure listing
  all the columns needed to be converted to binary representation, which
  is NIL for the case where no columns are needed to be converted.  For
  example,
  defel-arg is NIL for SELECT COUNT(*).  In this case, while
  cstate-convert_selectively is set to true, no columns are converted
  cstate-at
  NextCopyFrom().  Most efficient case!  In short,
  cstate-convert_selectively represents whether to do selective binary
  conversion at NextCopyFrom(), and
  cstate-convert_binary represents all the columns to be converted at
  NextCopyFrom(), that can be NIL.
 
  At NextCopyFrom(), this routine computes default values if configured.
  In case when these values are not referenced, it might be possible to
  skip unnecessary calculations.
  Is it unavailable to add logic to avoid to construct cstate-defmap
  on unreferenced columns at  BeginCopyFrom()?
 
  I think that we don't need to add the above logic because file_fdw
  does
  BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom()
  doesn't construct cstate-defmap at all.
 
  I fixed a bug plus some minor optimization in
  check_binary_conversion() that is renamed to
  check_selective_binary_conversion() in the updated version, and now
  file_fdw gives up selective binary conversion for the following
  cases:
 
   a) BINARY format
   b) CSV/TEXT format and whole row reference
   c) CSV/TEXT format and all the user attributes needed
 
 
  Best regards,
  Etsuro Fujita
 
  Thanks,
 
  2012/5/11 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
   -Original Message-
   From: Robert Haas [mailto:robertmh...@gmail.com]
   Sent: Friday, May 11, 2012 1:36 AM
   To: Etsuro Fujita
   Cc: pgsql-hackers@postgresql.org
   Subject: Re: [HACKERS] 

Re: [HACKERS] [PATCH 01/16] Overhaul walsender wakeup handling

2012-06-26 Thread Andres Freund
On Tuesday, June 26, 2012 04:01:26 PM Robert Haas wrote:
 On Fri, Jun 22, 2012 at 12:35 PM, Andres Freund and...@2ndquadrant.com 
wrote:
  Can you elaborate on that a bit?  What scenarios did you play around
  with, and what does win mean in this context?
  
  I had two machines connected locally and setup HS and my prototype
  between them (not at once obviously).
  The patch reduced all the average latency between both nodes (measured by
  'ticker' rows arriving in a table on the standby), the jitter in latency
  and the amount of load I had to put on the master before the standby
  couldn't keep up anymore.
  
  I played with different loads:
  * multple concurrent ~50MB COPY's
  * multple concurrent ~50MB COPY's, pgbench
  * pgbench
  
  All three had a ticker running concurrently with synchronous_commit=off
  (because it shouldn't cause any difference in the replication pattern
  itself).
  
  The difference in averagelag and cutoff were smallest with just pgbench
  running alone and biggest with COPY running alone. Highjitter was most
  visible with just pgbench running alone but thats likely just because
  the average lag was smaller.
 
 OK, that sounds pretty promising.  I'd like to run a few performance
 tests on this just to convince myself that it doesn't lead to a
 significant regression in other scenarios.  Assuming that doesn't turn
 up anything major, I'll go ahead and commit this.
Independent testing would be great, its definitely possible that I oversaw 
something although I obviously don't think so ;).

 Can you provide a rebased version?  It seems that one of the hunks in
 xlog.c no longer applies.
Will do so. Not sure if I can finish it today though, I am in the midst of 
redoing the ilist and xlogreader patches. I guess tomorrow will suffice 
otherwise...

Thanks!

Andres


-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reporting WAL file containing checkpoint's REDO record in pg_controldata's result

2012-06-26 Thread Fujii Masao
On Wed, Mar 28, 2012 at 10:08 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Mar 28, 2012 at 12:30 AM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:

 Excerpts from Fujii Masao's message of mar mar 27 06:40:34 -0300 2012:

 Anyway, should I add this patch into the next CF? Or is anyone planning
 to commit the patch for 9.2?

 I think the correct thing to do here is add to next CF, and if some
 committer has enough interest in getting it quickly it can be grabbed
 from there and committed into 9.2.

 Yep! I've added it to next CF.

Attached is the revised version of the patch.

Regards,

-- 
Fujii Masao


pg_controldata_walfilename_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-26 Thread Robert Haas
On Mon, Jun 25, 2012 at 4:25 PM, k...@rice.edu k...@rice.edu wrote:
 On Mon, Jun 25, 2012 at 09:45:26PM +0200, Florian Pflug wrote:
 On Jun25, 2012, at 21:21 , Dimitri Fontaine wrote:
  Magnus Hagander mag...@hagander.net writes:
  Or that it takes less code/generates cleaner code...
 
  So we're talking about some LZO things such as snappy from google, and
  that would be another run time dependency IIUC.
 
  I think it's time to talk about fastlz:
 
   http://fastlz.org/
   http://code.google.com/p/fastlz/source/browse/trunk/fastlz.c
 
   551 lines of C code under MIT license, works also under windows
 
  I guess it would be easy (and safe) enough to embed in our tree should
  we decide going this way.

 Agreed. If we extend the protocol to support compression, and not rely
 on SSL, then let's pick one of these LZ77-style compressors, and let's
 integrate it into our tree.

 We should then also make it possible to enable compression only for
 the server - client direction. Since those types of compressions are
 usually pretty easy to decompress, that reduces the amount to work
 non-libpq clients have to put in to take advantage of compression.

 Here is the benchmark list from the Google lz4 page:

 Name            Ratio   C.speed D.speed
 LZ4 (r59)       2.084   330      915
 LZO 2.05 1x_1   2.038   311      480
 QuickLZ 1.5 -1  2.233   257      277
 Snappy 1.0.5    2.024   227      729
 LZF             2.076   197      465
 FastLZ          2.030   190      420
 zlib 1.2.5 -1   2.728    39      195
 LZ4 HC (r66)    2.712    18     1020
 zlib 1.2.5 -6   3.095    14      210

 lz4 absolutely dominates on compression/decompression speed. While fastlz
 is faster than zlib(-1) on compression, lz4 is almost 2X faster still.

At the risk of making everyone laugh at me, has anyone tested pglz?  I
observe that if the compression ration and performance are good, we
might consider using it for this purpose, too, which would avoid
having to add dependencies.  Conversely, if they are bad, and we
decide to support another algorithm, we might consider also using that
other algorithm, at least optionally, for the purposes for which we
now use pglz.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] empty backup_label

2012-06-26 Thread Robert Haas
On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote:
 Howdy,

 We're using NetApp's flexclone's whenever we need to move our DB between 
 machines.

 One specific case where we do that is when we're creating a new streaming 
 replication target.

 The basic steps we're using are:
 pg_start_backup();
 flex clone within the netapp
 pg_stop_backup();

 The problem i'm seeing is that periodically the backup_label is empty, which 
 means
 I can't start the new standby.

 I believe that since the NetApp stuff is all happening within the SAN this 
 file hasn't been
 fsynced to disk by the time we take the snapshot.

 One option would be to do a sync prior to the clone, however that seems 
 kind of like a
 heavy operation, and it's slightly more complicated to script. (having to 
 have a user
 account on the system to sudo rather than just connecting to the db to issue 
 the
 pg_start_backup(...) )

 Another option is to add pg_fsync(fileno(fp)) after the fflush() when 
 creating the file (I'm not
 sure if fsync implies fflush or not, if it does you could just replace it.)

 I think this type of snapshot is fairly common, I've been doing them since 
 2000 with EMC,
 i'm sure that most SAN vendors support it.

These seems like a good idea to me.  Actually, I'm wondering if we
shouldn't back-patch this.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST subsplit question

2012-06-26 Thread Robert Haas
On Wed, May 30, 2012 at 4:35 PM, Alexander Korotkov
aekorot...@gmail.com wrote:
 So, do we demote that message to a DEBUG1? Or do we make it more clear
 what the authors of a specific picksplit are supposed to do to avoid
 that problem? Or am I misunderstanding something?


 +1 for demote message to DEBUG1. I think it shouldn't be so noisy, it just
 indicates something could be improved.
 Also I think we defenitely need to document secondary split. Now it's no
 chances to understand without reverse engeneering it from code.

I'm happy to go demote the message if we have consensus on that, but
somebody else is going to need to provide the doc patch.  Any takers?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to avoid base backup in automated failover

2012-06-26 Thread Robert Haas
On Mon, Jun 4, 2012 at 9:14 AM, chinnaobi chinna...@gmail.com wrote:
 Recently I was writing an application to implement automated failover with
 env: Two 2008 R2 servers, Network area storage, asynchronous replication,
 WAL archive on primary enabled.

 Is there any way to avoid starting standby server always from base backup in
 automated failover. I see the database is growing huge. I can't keep doing
 base backup every day.

 Please suggest solution

The usual solution is to configure the standby as a warm or hot
standby, so that logs are continuously replayed there.  Then if the
master dies, you only have to wait for replication to catch up before
promoting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-26 Thread Euler Taveira
On 26-06-2012 12:23, Robert Haas wrote:
 At the risk of making everyone laugh at me, has anyone tested pglz?  I
 observe that if the compression ration and performance are good, we
 might consider using it for this purpose, too, which would avoid
 having to add dependencies.  Conversely, if they are bad, and we
 decide to support another algorithm, we might consider also using that
 other algorithm, at least optionally, for the purposes for which we
 now use pglz.
 
I'll remember to test it too.


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] empty backup_label

2012-06-26 Thread Magnus Hagander
On Tue, Jun 26, 2012 at 5:24 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote:
 Howdy,

 We're using NetApp's flexclone's whenever we need to move our DB between 
 machines.

 One specific case where we do that is when we're creating a new streaming 
 replication target.

 The basic steps we're using are:
 pg_start_backup();
 flex clone within the netapp
 pg_stop_backup();

 The problem i'm seeing is that periodically the backup_label is empty, which 
 means
 I can't start the new standby.

 I believe that since the NetApp stuff is all happening within the SAN this 
 file hasn't been
 fsynced to disk by the time we take the snapshot.

 One option would be to do a sync prior to the clone, however that seems 
 kind of like a
 heavy operation, and it's slightly more complicated to script. (having to 
 have a user
 account on the system to sudo rather than just connecting to the db to issue 
 the
 pg_start_backup(...) )

 Another option is to add pg_fsync(fileno(fp)) after the fflush() when 
 creating the file (I'm not
 sure if fsync implies fflush or not, if it does you could just replace it.)

 I think this type of snapshot is fairly common, I've been doing them since 
 2000 with EMC,
 i'm sure that most SAN vendors support it.

 These seems like a good idea to me.  Actually, I'm wondering if we
 shouldn't back-patch this.

 Thoughts?

Certainly can't hurt.

I guess any other files that are lost this way will be recreated by
WAL recovery - or is there something else tha tmight be of risk of
similar treatment?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq compression

2012-06-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 25, 2012 at 4:25 PM, k...@rice.edu k...@rice.edu wrote:
 Here is the benchmark list from the Google lz4 page:
 
 NameRatio   C.speed D.speed
 LZ4 (r59)   2.084   330  915
 LZO 2.05 1x_1   2.038   311  480
 QuickLZ 1.5 -1  2.233   257  277
 Snappy 1.0.52.024   227  729
 LZF 2.076   197  465
 FastLZ  2.030   190  420
 zlib 1.2.5 -1   2.72839  195
 LZ4 HC (r66)2.71218 1020
 zlib 1.2.5 -6   3.09514  210

 lz4 absolutely dominates on compression/decompression speed. While fastlz
 is faster than zlib(-1) on compression, lz4 is almost 2X faster still.

 At the risk of making everyone laugh at me, has anyone tested pglz?

Another point here is that those Google numbers (assuming that they
apply to our use-cases, a point not in evidence) utterly fail to make
the argument that zlib is not the thing to use.  zlib is beating all
the others on compression ratio, often by 50%.  Before making any
comparisons, you have to make some assumptions about what the network
speed is ... and unless it's pretty damn fast relative to your CPU speed
getting the better compression ratio is going to be more attractive than
saving some cycles.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Schema version management

2012-06-26 Thread Robert Haas
On Tue, May 22, 2012 at 11:31 PM, Joel Jacobson j...@trustly.com wrote:
 This is true, which means some users won't be able to use the feature,
 because they are using an ancient OS or have function names with slashes,
 hm, is it even possible to have function names with slashes?

Sure.  If you quote the function name, you can put anything you want
in there.  Note that Windows disallows a whole bunch of special
characters in filenames, while UNIX-like systems tend to disallow only
slash.

 I suppose you have a lot more experience of what postgres installations exists
 in the world. Do you think it's common databases have non-ascii problematic
 characters in object names?

 Is it a project policy all features of all standard tools must be
 useful for all users
 on all platforms on all databases? Or is it acceptable if some features are 
 only
 useable for, say, 90% of the users?

There are cases where we permit features that only work on some
platforms, but it's rare.  Usually, we do this only when the platform
lacks some API that exists elsewhere.  For example, collations and
prefetching are not supported on Windows because the UNIX APIs we use
don't exist there.

In this case, it seems like you could work around the problem by, say,
URL-escaping any characters that can't be used in an unquoted
identifier.  Of course that might make the file name long enough to
hit the platform-specific file name limit.  Not sure what to do about
that.  The basic idea you're proposing here has been proposed a number
of times before, but it's always fallen down over questions of (1)
what do do with very long object names or those containing special
characters and (2) objects (like functions) for which schema+name is
not a unique identifier.

I don't think either of these problems ought to be a complete
show-stopper.  It seems to me that the trade-off is that when object
names are long, contain special characters, or are overloaded, we'll
have to munge the names in some way to prevent collisions.  That could
mean that the names are not 100% stable, which would possibly produce
some annoyance if you're using a VCS to track changes, but maybe
that's an acceptable trade-off, because it shouldn't happen very
often.  If we could guararantee that identifiers less than 64
characters which are not overloaded and contain no special characters
requiring quoting end up in an eponymous file, I think that would be
good enough to make most of our users pretty happy.  In other cases, I
think the point would just be to make it work (with a funny name)
rather than fail.

 \i /home/postgres/database/gluepay-split/public/TYPE/dblink_pkey_results.sql
 \i /home/postgres/database/gluepay-split/public/TYPE/r_matchedwithdrawal.sql
 \i 
 /home/postgres/database/gluepay-split/public/TYPE/r_unapprovedwithdrawal.sql
 \i 
 /home/postgres/database/gluepay-split/public/TYPE/ukaccountvalidationchecktype.sql
 \i /home/postgres/database/gluepay-split/aml/FUNCTION/check_name.sql
 \i /home/postgres/database/gluepay-split/aml/FUNCTION/describe_entityid.sql
 \i /home/postgres/database/gluepay-split/aml/FUNCTION/get_linkid.sql
 \i /home/postgres/database/gluepay-split/aml/FUNCTION/set_address.sql
 -- ... all the objects ..
 \i 
 /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerid_fkey.sql
 \i 
 /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workerstatusid_fkey.sql
 \i 
 /home/postgres/database/gluepay-split/public/FK_CONSTRAINT/workershistory_workertypeid_fkey.sql

It would be better to use \ir here rather than hard-code path names, I
think.  Then you'd only need to require that all the files be in the
same directory, rather than requiring them to be at a certain
hard-coded place in the filesystem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] empty backup_label

2012-06-26 Thread David Kerr
On Tue, Jun 26, 2012 at 05:33:42PM +0200, Magnus Hagander wrote:
- On Tue, Jun 26, 2012 at 5:24 PM, Robert Haas robertmh...@gmail.com wrote:
-  On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote:
-  Howdy,
- 
-  We're using NetApp's flexclone's whenever we need to move our DB between 
machines.
- 
-  One specific case where we do that is when we're creating a new streaming 
replication target.
- 
-  The basic steps we're using are:
-  pg_start_backup();
-  flex clone within the netapp
-  pg_stop_backup();
- 
-  The problem i'm seeing is that periodically the backup_label is empty, 
which means
-  I can't start the new standby.
- 
-  I believe that since the NetApp stuff is all happening within the SAN this 
file hasn't been
-  fsynced to disk by the time we take the snapshot.
- 
-  One option would be to do a sync prior to the clone, however that seems 
kind of like a
-  heavy operation, and it's slightly more complicated to script. (having to 
have a user
-  account on the system to sudo rather than just connecting to the db to 
issue the
-  pg_start_backup(...) )
- 
-  Another option is to add pg_fsync(fileno(fp)) after the fflush() when 
creating the file (I'm not
-  sure if fsync implies fflush or not, if it does you could just replace it.)
- 
-  I think this type of snapshot is fairly common, I've been doing them since 
2000 with EMC,
-  i'm sure that most SAN vendors support it.
- 
-  These seems like a good idea to me.  Actually, I'm wondering if we
-  shouldn't back-patch this.
- 
-  Thoughts?
- 
- Certainly can't hurt.
- 
- I guess any other files that are lost this way will be recreated by
- WAL recovery - or is there something else tha tmight be of risk of
- similar treatment?

The only other file I've run into is pgstats.stat, which I think is ok to get 
blown away.
There certianly could be others though.

Dave

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] empty backup_label

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 11:33 AM, Magnus Hagander mag...@hagander.net wrote:
 On Tue, Jun 26, 2012 at 5:24 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sun, Jun 24, 2012 at 5:33 PM, David Kerr d...@mr-paradox.net wrote:
 Howdy,

 We're using NetApp's flexclone's whenever we need to move our DB between 
 machines.

 One specific case where we do that is when we're creating a new streaming 
 replication target.

 The basic steps we're using are:
 pg_start_backup();
 flex clone within the netapp
 pg_stop_backup();

 The problem i'm seeing is that periodically the backup_label is empty, 
 which means
 I can't start the new standby.

 I believe that since the NetApp stuff is all happening within the SAN this 
 file hasn't been
 fsynced to disk by the time we take the snapshot.

 One option would be to do a sync prior to the clone, however that seems 
 kind of like a
 heavy operation, and it's slightly more complicated to script. (having to 
 have a user
 account on the system to sudo rather than just connecting to the db to 
 issue the
 pg_start_backup(...) )

 Another option is to add pg_fsync(fileno(fp)) after the fflush() when 
 creating the file (I'm not
 sure if fsync implies fflush or not, if it does you could just replace it.)

 I think this type of snapshot is fairly common, I've been doing them since 
 2000 with EMC,
 i'm sure that most SAN vendors support it.

 These seems like a good idea to me.  Actually, I'm wondering if we
 shouldn't back-patch this.

 Thoughts?

 Certainly can't hurt.

 I guess any other files that are lost this way will be recreated by
 WAL recovery - or is there something else tha tmight be of risk of
 similar treatment?

I can't think of anything.  pg_start_backup does a checkpoint, which
in theory oughta be enough to make sure everything that matters hits
the platter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hot standby PSQL 9.1 Windows 2008 Servers

2012-06-26 Thread Robert Haas
On Tue, May 22, 2012 at 12:15 PM, chinnaobi chinna...@gmail.com wrote:
 You mean when the primary which is going to switch its role to standby might
 not have sent all the WAL records to the standby and If it is switched to
 standby it has more WAL records than the standby which is now serves as
 primary. Is it ??

Yes, that is possible.  Or the standby might have received all the WAL
records but not be caught up in terms of replaying them.

 It is actually the standby server which has to be restored from archive when
 it is switching to primary right .. Not the primary which is switching to
 standby ??

If you want to promote a standby, you can just do it (pg_ctl promote).
 If you have a master that you want to demote to a standby, you've got
to resync it to whatever the current master is.  I understand repmgr
has some tooling to help automate that, although I have not played
with it myself.  In any event rsync can be a big help in reducing the
resync time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Alvaro Herrera
Excerpts from Boszormenyi Zoltan's message of mar jun 26 03:59:06 -0400 2012:
 
 2012-06-26 06:59 keltezéssel, Alvaro Herrera írta:
  I cleaned up the framework patch a bit.  My version's attached.  Mainly,
  returning false for failure in some code paths that are only going to
  have the caller elog(FATAL) is rather pointless -- it seems much better
  to just have the code itself do the elog(FATAL).  In any case, I
  searched for reports of those error messages being reported in the wild
  and saw none.
 
 OK. The cleanups are always good.
 
 One nitpick, though. Your version doesn't contain the timeout.h
 and compilation fails because of it. :-) You might have forgotten
 to do git commit -a.

Oops.  Attached. It's pretty much the same you had, except some bools
are changed to void.

  There is one thing that still bothers me on this whole framework patch,
  but I'm not sure it's easily fixable.  Basically the API to timeout.c is
  the whole list of timeouts and their whole behaviors.  If you want to
  add a new one, you can't just call into the entry points, you have to
  modify timeout.c itself (as well as timeout.h as well as whatever code
  it is that you want to add timeouts to).  This may be good enough, but I
  don't like it.  I think it boils down to proctimeout.h is cheating.
 
  The interface I would actually like to have is something that lets the
  modules trying to get a timeout register the timeout-checking function
  as a callback.  API-wise, it would be much cleaner.  However, I'm not
  raelly sure that code-wise it would be any cleaner at all.  In fact I
  think it'd be rather cumbersome.  However, if you could give that idea
  some thought, it'd be swell.
 
 Well, I can make the registration interface similar to how LWLocks
 are treated, but that doesn't avoid modification of the base_timeouts
 array in case a new internal use case arises. Say:
 
 #define USER_TIMEOUTS4
 
 intn_timeouts = TIMEOUT_MAX;
 static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS];
 
 and register_timeout() adds data after TIMEOUT_MAX.

Well, I don't expect that we're going to have many external uses.  My
point about registration is so that internal callers use it as well.  I
was thinking we could do something like xact.c does, which is to have
somewhere in proc.c a call like this:

TimeoutRegister(DEADLOCK_TIMEOUT, 1, true, CheckDeadLock, GetCurrentTimestamp);

at process startup (the magic value 1 is the priority.  Maybe there's a
better way to handle this).  That way, timeout.c or timeout.h do not
need to know anything about proc.c at all; we don't need proctimeout.h
at all.  The only thing it (the timeout module) needs to know is that
there is a symbolic constant named DEADLOCK_TIMEOUT.  Similarly for
statement timeout, etc.

When you call enable_timeout you first have to ensure that
DEADLOCK_TIMEOUT has been registered; and if it's not, die on an
Assert().  That way you ensure that there are no bugs where you try to
enable a timeout that hasn't been registered.

(If we later find the need to let external modules add timeouts, which I
find unlikely, we can have TimeoutRegister return TimeoutName and have
this value be dynamically allocated.  But for now I think that would be
useless complication).

The difference with lwlocks is that each lwlock is just a number.  The
lwlock.c module doesn't need to know anything about how each lwlock is
actually used.  It's not the same thing here -- which is why I think it
would be better to have all modules, even the hardcoded ones, use a
registering interface.

... Now, it could be argued that it would be even better to have
TimeoutRegister not take the TimeoutName argument, and instead generate
it dynamically and pass it back to the caller -- that way you don't need
the enum in timeout.h.  The problem I have with that idea is that it
would force the caller modules to have a global variable to keep track
of that, which seems worse to me.

  I haven't looked at the second patch at all yet.
 
 This is the whole point of the first patch.

I know that.  But I want to get the details of the framework right
before we move on to add more stuff to it.

 But as I said above for
 registering a new timeout source, it's a new internal use case.
 One that touches a lot of places which cannot simply be done
 by registering a new timeout source.

Sure.  That's going to be the case for any other timeout we want to add
(which is why I think we don't really need dynamic timeouts).

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


1-timeout-framework-v9a.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.3] Row-Level Security

2012-06-26 Thread Robert Haas
On Thu, Jun 14, 2012 at 11:43 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 In the previous discussion, we planned to add a syntax option to
 clarify the command type to fire the RLS policy, such as FOR UPDATE.
 But current my opinion is, it is not necessary. For example, we can
 reference the contents of rows being updated / deleted using
 RETURNING clause. So, it does not make sense to have different
 RLS policy at UPDATE / DELETE from SELECT.

I agree.  That doesn't make sense, and we don't need to support it.

 On the other hand, I'm not 100% sure about my design to restrict
 rows to be updated and deleted. Similarly, it expands the target
 relation of UPDATE or DELETE statement into a sub-query with
 condition. ExecModifyTable() pulls a tuple from the sub-query,
 instead of regular table, so it seems to me working at least, but
 I didn't try all the possible cases of course.

I don't think there's any reason why that shouldn't work.  DELETE ..
USING already allows ModifyTable to be scanning a join product, so
this is not much different.

 Of course, here is some limitations, to keep the patch size reasonable
 level to review.
 - The permission to bypass RLS policy was under discussion.
  If and when we should provide a special permission to bypass RLS
  policy, the OR has_superuser_privilege() shall be replaced by
  OR has_table_privilege(tableoid, 'RLSBYPASS').

I think you're missing the point.  Everyone who has commented on this
issue is in favor of having some check that causes the RLS predicate
*not to get added in the first place*.  Adding a modified predicate is
not the same thing.  First, the query planner might not be smart
enough to optimize away the clause even when the predicate holds,
causing an unnecessary performance drain.  Second, there's too much
danger of being able to set a booby-trap for the superuser this way.
Suppose that the RLS subsystem replaces f_malicious() by f_malicious
OR has_superuser_privilege().  Now the superuser comes along with the
nightly pg_dump run and the query optimizer executes SELECT * FROM
nuts WHERE f_malicious() OR has_superuser_privilege().  The query
optimizer notes that the cost of f_malicious() is very low and decides
to execute that before has_superuser_privilege().  Oops.  I think it's
just not acceptable to handle this by clause-munging: we need to not
add the clause in the first place.

Comments on the patch itself:

1. Please do not abbreviate rowlevel to rowlv or RowLevel to RowLv or
ROWLEVEL to ROWLV.  That makes it harder to read and harder to grep.
Spell it out.

2. Since the entirety of ATExecSetRowLvSecurity is conditional on
whether clause != NULL, you might as well split it into two functions,
one for each case.

3. The fact that you've had to hack preprocess_targetlist and
adjust_appendrel_attrs_mutator suggests to me that the insertion of
the generate subquery is happening at the wrong phase of the process.
We don't need those special cases for views, so it seems like we
shouldn't need them here, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Boszormenyi Zoltan

2012-06-26 18:12 keltezéssel, Alvaro Herrera írta:

Excerpts from Boszormenyi Zoltan's message of mar jun 26 03:59:06 -0400 2012:

2012-06-26 06:59 keltezéssel, Alvaro Herrera írta:

I cleaned up the framework patch a bit.  My version's attached.  Mainly,
returning false for failure in some code paths that are only going to
have the caller elog(FATAL) is rather pointless -- it seems much better
to just have the code itself do the elog(FATAL).  In any case, I
searched for reports of those error messages being reported in the wild
and saw none.

OK. The cleanups are always good.

One nitpick, though. Your version doesn't contain the timeout.h
and compilation fails because of it. :-) You might have forgotten
to do git commit -a.

Oops.  Attached. It's pretty much the same you had, except some bools
are changed to void.


There is one thing that still bothers me on this whole framework patch,
but I'm not sure it's easily fixable.  Basically the API to timeout.c is
the whole list of timeouts and their whole behaviors.  If you want to
add a new one, you can't just call into the entry points, you have to
modify timeout.c itself (as well as timeout.h as well as whatever code
it is that you want to add timeouts to).  This may be good enough, but I
don't like it.  I think it boils down to proctimeout.h is cheating.

The interface I would actually like to have is something that lets the
modules trying to get a timeout register the timeout-checking function
as a callback.  API-wise, it would be much cleaner.  However, I'm not
raelly sure that code-wise it would be any cleaner at all.  In fact I
think it'd be rather cumbersome.  However, if you could give that idea
some thought, it'd be swell.

Well, I can make the registration interface similar to how LWLocks
are treated, but that doesn't avoid modification of the base_timeouts
array in case a new internal use case arises. Say:

#define USER_TIMEOUTS4

intn_timeouts = TIMEOUT_MAX;
static timeout_params base_timeouts[TIMEOUT_MAX + USER_TIMEOUTS];

and register_timeout() adds data after TIMEOUT_MAX.

Well, I don't expect that we're going to have many external uses.  My
point about registration is so that internal callers use it as well.  I
was thinking we could do something like xact.c does, which is to have
somewhere in proc.c a call like this:

TimeoutRegister(DEADLOCK_TIMEOUT, 1, true, CheckDeadLock, GetCurrentTimestamp);

at process startup (the magic value 1 is the priority.  Maybe there's a
better way to handle this).  That way, timeout.c or timeout.h do not
need to know anything about proc.c at all; we don't need proctimeout.h
at all.  The only thing it (the timeout module) needs to know is that
there is a symbolic constant named DEADLOCK_TIMEOUT.  Similarly for
statement timeout, etc.

When you call enable_timeout you first have to ensure that
DEADLOCK_TIMEOUT has been registered; and if it's not, die on an
Assert().  That way you ensure that there are no bugs where you try to
enable a timeout that hasn't been registered.


Currently, TimeoutName (as an index value) and priority is the same
semantically.

I would also add an Assert into register_timeout() to avoid double registration
of the same to prevent overriding he callback function pointer. If that's in
place, the TimeoutName value may still work as is: an index into 
base_timeouts[].


(If we later find the need to let external modules add timeouts, which I
find unlikely, we can have TimeoutRegister return TimeoutName and have
this value be dynamically allocated.  But for now I think that would be
useless complication).

The difference with lwlocks is that each lwlock is just a number.


Strictly speaking, just as TimeoutName.


   The
lwlock.c module doesn't need to know anything about how each lwlock is
actually used.  It's not the same thing here -- which is why I think it
would be better to have all modules, even the hardcoded ones, use a
registering interface.


OK.


... Now, it could be argued that it would be even better to have
TimeoutRegister not take the TimeoutName argument, and instead generate
it dynamically and pass it back to the caller -- that way you don't need
the enum in timeout.h.


This was what I had in mind at first ...


   The problem I have with that idea is that it
would force the caller modules to have a global variable to keep track
of that, which seems worse to me.


... and realized this as well.

So, should I keep the enum TimeoutName? Are global variables for
keeping dynamically assigned values preferred over the enum?
Currently we have 5 timeout sources in total, 3 of them are used by
regular backends, the remaining 2 are used by replication standby.
We can have a fixed size array (say with 8 or 16 elements) for future use
and this would be plenty.

Opinions?




I haven't looked at the second patch at all yet.

This is the whole point of the first patch.

I know that.  But I want to get the details of the framework right
before we move on to add more stuff to 

Re: [HACKERS] [PATCH] lock_timeout and common SIGALRM framework

2012-06-26 Thread Alvaro Herrera

Excerpts from Boszormenyi Zoltan's message of mar jun 26 12:43:34 -0400 2012:

 So, should I keep the enum TimeoutName? Are global variables for
 keeping dynamically assigned values preferred over the enum?
 Currently we have 5 timeout sources in total, 3 of them are used by
 regular backends, the remaining 2 are used by replication standby.
 We can have a fixed size array (say with 8 or 16 elements) for future use
 and this would be plenty.
 
 Opinions?

My opinion is that the fixed size array is fine.

I'll go set the patch waiting on author.  Also, remember to review
some other people's patches.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Nils Goroll
Hi,

I am currently trying to understand what looks like really bad scalability of
9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only
marginal amounts of additional load seem to push it to 70% and the application
becomes highly unresponsive.

My current understanding basically matches the issues being addressed by various
9.2 improvements, well summarized in
http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf

An additional aspect is that, in order to address the latent risk of data loss 
corruption with WBCs and async replication, we have deliberately moved the db
from a similar system with WB cached storage to ssd based storage without a WBC,
which, by design, has (in the best WBC case) approx. 100x higher latencies, but
much higher sustained throughput.


On the new system, even with 30% user acceptable load, oprofile makes apparent
significant lock contention:

opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres


Profiling through timer interrupt
samples  %image name   symbol name
3024027.9720  postgres s_lock
5069  4.6888  postgres GetSnapshotData
3743  3.4623  postgres AllocSetAlloc
3167  2.9295  libc-2.12.so strcoll_l
2662  2.4624  postgres SearchCatCache
2495  2.3079  postgres hash_search_with_hash_value
2143  1.9823  postgres nocachegetattr
1860  1.7205  postgres LWLockAcquire
1642  1.5189  postgres base_yyparse
1604  1.4837  libc-2.12.so __strcmp_sse42
1543  1.4273  libc-2.12.so __strlen_sse42
1156  1.0693  libc-2.12.so memcpy

Unfortunately I don't have profiling data for the high-load / contention
condition yet, but I fear the picture will be worse and pointing in the same
direction.

pure speculation
In particular, the _impression_ is that lock contention could also be related to
I/O latencies making me fear that cases could exist where spin locks are being
helt while blocking on IO.
/pure speculation


Looking at the code, it appears to me that the roll-your-own s_lock code cannot
handle a couple of cases, for instance it will also spin when the lock holder is
not running at all or blocking on IO (which could even be implicit, e.g. for a
page flush). These issues have long been addressed by adaptive mutexes and 
futexes.

Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when
having spun for long (not not blocked), spin even longer in future), which
appears to me to have the potential of becoming highly counter-productive.


Now that the scene is set, here's the simple question: Why all this? Why not
simply use posix mutexes which, on modern platforms, will map to efficient
implementations like adaptive mutexes or futexes?

Thanks, Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: Improve DROP FUNCTION hint

2012-06-26 Thread Robert Haas
On Mon, Jun 11, 2012 at 11:12 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jun 9, 2012 at 11:42 AM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 Hi,

 Attached is a small patch to improve the HINT message produced by
 CREATE OR REPLACE FUNCTION when the new function definition conflicts
 with the old definition. With this patch the hint now includes the
 function's name and signature as a directly pasteable SQL command. So,
 for example, instead of

 psql:functions.sql:70: ERROR:  cannot change return type of existing function
 HINT:  Use DROP FUNCTION first.

 it now says

 psql:functions.sql:70: ERROR:  cannot change return type of existing function
 HINT:  Use DROP FUNCTION foo(integer,integer) first.

 which saves having to open the file, find the function and then type
 in the DROP statement manually.

 +1.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics

2012-06-26 Thread Josh Berkus

 To implement it, a new array can be added in the local process memory
 to hold lwlock statistics, and update counters both in the shared
 memory and the local process memory at once. Then, the session can
 retrieve 'per-session' statistics from the local process memory
 via some dedicated function.

That would be way cool from a diagnostics perspective.  Not sure what
impact it has, though.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GiST subsplit question

2012-06-26 Thread Jeff Davis
On Tue, 2012-06-26 at 11:28 -0400, Robert Haas wrote:
 On Wed, May 30, 2012 at 4:35 PM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  So, do we demote that message to a DEBUG1? Or do we make it more clear
  what the authors of a specific picksplit are supposed to do to avoid
  that problem? Or am I misunderstanding something?
 
 
  +1 for demote message to DEBUG1. I think it shouldn't be so noisy, it just
  indicates something could be improved.
  Also I think we defenitely need to document secondary split. Now it's no
  chances to understand without reverse engeneering it from code.
 
 I'm happy to go demote the message if we have consensus on that, but
 somebody else is going to need to provide the doc patch.  Any takers?

I was planning to do that, but it won't be for a few days at least. If
someone else wants to do it sooner, feel free.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread David Fetter
On Tue, Jun 26, 2012 at 10:12:52AM +0200, Pavel Stehule wrote:
 2012/6/26 Magnus Hagander mag...@hagander.net:
  On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com 
  wrote:
  2012/6/26 Magnus Hagander mag...@hagander.net:
  On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com 
  wrote:
  Hello
 
  I worked on simple patch, that enable access from server side to
  client side data. It add two new hooks to libpq - one for returning of
  local context, second for setting of local context.
 
  A motivation is integration of possibilities of psql console together
  with stronger language - plpgsql. Second target is enabling
  possibility to save a result of some server side process in psql. It
  improve vars feature in psql.
 
  pavel ~/src/postgresql/src $ cat test.sql
  \echo value of external paremeter is :myvar
 
  do $$
  begin
   -- we can take any session variable on client side
   -- it is safe against to SQL injection
   raise notice 'external parameter accessed from plpgsql is %',
  hgetvar('myvar');
 
   -- we can change this session variable and finish transaction
   perform hsetvar('myvar', 'Hello, World');
  end;
  $$ language plpgsql;
 
  \echo new value of session variable is :myvar
 
  cat test.sql | psql postgres -v myvar=Hello
  value of external paremeter is Hello
  NOTICE:  external parameter accessed from plpgsql is Hello
  DO
  new value of session variable is Hello, World
 
  This is just proof concept - there should be better integration with
  pl languages, using cache for read on server side, ...
 
  Notices?
 
  Why not just use a custom GUC variable instead? E.g. you could have
  psql SET psql.myvar='Hello, World', and then you'd need no changes
  at all in the backend? Maybe have a shorthand interface for
  accessing GUCs in psql would help in making it easier, but do we
  really need a whole new variable concept?
 
  GUC variables doesn't help with access to psql's command line
  parameters from DO PL code.
 
  But with a small change to psql they could, without the need for a
  whole new type of variable. For example, psql could set all those
  variable as psql.commandlinevarname, which could then be accessed
  from the DO PL code just fine.
 
 yes, it is possibility too.  It has different issues - it can send
 unwanted variables -

Could you expand on this just a bit?  Are you picturing something an
attacker could somehow use, or...?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread David Fetter
On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
 Andres Freund and...@2ndquadrant.com wrote:
  
  I am not sure were going to get all that into 9.3.
  
 Sure, that was more related to why I was questioning how much these
 use cases even *could* integrate -- whether it even paid to
 *consider* these use cases at this point.  Responses from Robert and
 you have pretty much convinced me that I was being overly
 pessimistic on that.
  
 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in both
 the BEFORE and AFTER tuple images, as long as we have the null
 column bitmaps -- or some other way of distinguishing unchanged from
 NULL.  (This could be especially important when the unchanged column
 was a 50 MB bytea.)

How about two bitmaps: one telling which columns are actually there,
the other with NULLs?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Merlin Moncure
On Tue, Jun 26, 2012 at 12:02 PM, Nils Goroll sl...@schokola.de wrote:
 Hi,

 I am currently trying to understand what looks like really bad scalability of
 9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but 
 only
 marginal amounts of additional load seem to push it to 70% and the application
 becomes highly unresponsive.

 My current understanding basically matches the issues being addressed by 
 various
 9.2 improvements, well summarized in
 http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf

 An additional aspect is that, in order to address the latent risk of data 
 loss 
 corruption with WBCs and async replication, we have deliberately moved the db
 from a similar system with WB cached storage to ssd based storage without a 
 WBC,
 which, by design, has (in the best WBC case) approx. 100x higher latencies, 
 but
 much higher sustained throughput.


 On the new system, even with 30% user acceptable load, oprofile makes 
 apparent
 significant lock contention:

 opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres


 Profiling through timer interrupt
 samples  %        image name               symbol name
 30240    27.9720  postgres                 s_lock
 5069      4.6888  postgres                 GetSnapshotData
 3743      3.4623  postgres                 AllocSetAlloc
 3167      2.9295  libc-2.12.so             strcoll_l
 2662      2.4624  postgres                 SearchCatCache
 2495      2.3079  postgres                 hash_search_with_hash_value
 2143      1.9823  postgres                 nocachegetattr
 1860      1.7205  postgres                 LWLockAcquire
 1642      1.5189  postgres                 base_yyparse
 1604      1.4837  libc-2.12.so             __strcmp_sse42
 1543      1.4273  libc-2.12.so             __strlen_sse42
 1156      1.0693  libc-2.12.so             memcpy

 Unfortunately I don't have profiling data for the high-load / contention
 condition yet, but I fear the picture will be worse and pointing in the same
 direction.

 pure speculation
 In particular, the _impression_ is that lock contention could also be related 
 to
 I/O latencies making me fear that cases could exist where spin locks are being
 helt while blocking on IO.
 /pure speculation


 Looking at the code, it appears to me that the roll-your-own s_lock code 
 cannot
 handle a couple of cases, for instance it will also spin when the lock holder 
 is
 not running at all or blocking on IO (which could even be implicit, e.g. for a
 page flush). These issues have long been addressed by adaptive mutexes and 
 futexes.

 Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when
 having spun for long (not not blocked), spin even longer in future), which
 appears to me to have the potential of becoming highly counter-productive.


 Now that the scene is set, here's the simple question: Why all this? Why not
 simply use posix mutexes which, on modern platforms, will map to efficient
 implementations like adaptive mutexes or futexes?

Well, that would introduce a backend dependency on pthreads, which is
unpleasant.  Also you'd need to feature test via
_POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between
processes (and configure your mutexes as such when you do).  There are
probably other reasons why this can't be done, but I personally don' t
klnow of any.

Also, it's forbidden to do things like invoke i/o in the backend while
holding only a spinlock. As to your larger point, it's an interesting
assertion -- some data to back it up would help.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Pavel Stehule
2012/6/26 David Fetter da...@fetter.org:
 On Tue, Jun 26, 2012 at 10:12:52AM +0200, Pavel Stehule wrote:
 2012/6/26 Magnus Hagander mag...@hagander.net:
  On Tue, Jun 26, 2012 at 9:50 AM, Pavel Stehule pavel.steh...@gmail.com 
  wrote:
  2012/6/26 Magnus Hagander mag...@hagander.net:
  On Tue, Jun 26, 2012 at 7:06 AM, Pavel Stehule pavel.steh...@gmail.com 
  wrote:
  Hello
 
  I worked on simple patch, that enable access from server side to
  client side data. It add two new hooks to libpq - one for returning of
  local context, second for setting of local context.
 
  A motivation is integration of possibilities of psql console together
  with stronger language - plpgsql. Second target is enabling
  possibility to save a result of some server side process in psql. It
  improve vars feature in psql.
 
  pavel ~/src/postgresql/src $ cat test.sql
  \echo value of external paremeter is :myvar
 
  do $$
  begin
   -- we can take any session variable on client side
   -- it is safe against to SQL injection
   raise notice 'external parameter accessed from plpgsql is %',
  hgetvar('myvar');
 
   -- we can change this session variable and finish transaction
   perform hsetvar('myvar', 'Hello, World');
  end;
  $$ language plpgsql;
 
  \echo new value of session variable is :myvar
 
  cat test.sql | psql postgres -v myvar=Hello
  value of external paremeter is Hello
  NOTICE:  external parameter accessed from plpgsql is Hello
  DO
  new value of session variable is Hello, World
 
  This is just proof concept - there should be better integration with
  pl languages, using cache for read on server side, ...
 
  Notices?
 
  Why not just use a custom GUC variable instead? E.g. you could have
  psql SET psql.myvar='Hello, World', and then you'd need no changes
  at all in the backend? Maybe have a shorthand interface for
  accessing GUCs in psql would help in making it easier, but do we
  really need a whole new variable concept?
 
  GUC variables doesn't help with access to psql's command line
  parameters from DO PL code.
 
  But with a small change to psql they could, without the need for a
  whole new type of variable. For example, psql could set all those
  variable as psql.commandlinevarname, which could then be accessed
  from the DO PL code just fine.

 yes, it is possibility too.  It has different issues - it can send
 unwanted variables -

 Could you expand on this just a bit?  Are you picturing something an
 attacker could somehow use, or...?

it is not security issue - just I dislike sending complete stack, when
just only one variable should be used.

Regards

Pavel


 Cheers,
 David.
 --
 David Fetter da...@fetter.org http://fetter.org/
 Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
 Skype: davidfetter      XMPP: david.fet...@gmail.com
 iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

 Remember to vote!
 Consider donating to Postgres: http://www.postgresql.org/about/donate

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Tom Lane
Nils Goroll sl...@schokola.de writes:
 Now that the scene is set, here's the simple question: Why all this? Why not
 simply use posix mutexes which, on modern platforms, will map to efficient
 implementations like adaptive mutexes or futexes?

(1) They do not exist everywhere.
(2) There is absolutely no evidence to suggest that they'd make things better.

If someone cared to rectify (2), we could consider how to use them as an
alternative implementation.  But if you start with let's not support
any platforms that don't have this feature, you're going to get a cold
reception.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Nils Goroll
Hi Merlin,

 _POSIX_THREAD_PROCESS_SHARED

sure.

 Also, it's forbidden to do things like invoke i/o in the backend while
 holding only a spinlock. As to your larger point, it's an interesting
 assertion -- some data to back it up would help.

Let's see if I can get any. ATM I've only got indications, but no proof.

Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Nils Goroll

 But if you start with let's not support any platforms that don't have this 
 feature

This will never be my intention.

Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Merlin Moncure
On Tue, Jun 26, 2012 at 3:05 AM, Magnus Hagander mag...@hagander.net wrote:
 But with a small change to psql they could, without the need for a
 whole new type of variable. For example, psql could set all those
 variable as psql.commandlinevarname, which could then be accessed
 from the DO PL code just fine.

That's a really neat idea.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Pavel Stehule
2012/6/26 Merlin Moncure mmonc...@gmail.com:
 On Tue, Jun 26, 2012 at 3:05 AM, Magnus Hagander mag...@hagander.net wrote:
 But with a small change to psql they could, without the need for a
 whole new type of variable. For example, psql could set all those
 variable as psql.commandlinevarname, which could then be accessed
 from the DO PL code just fine.

 That's a really neat idea.

yes, it can be good idea - psql sends some status variables on start,
so it should be small patch

Pavel


 merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus
Robert, all:

Last I checked, we had a reasonably acceptable patch to use mostly Posix
Shared mem with a very small sysv ram partition.  Is there anything
keeping this from going into 9.3?  It would eliminate a major
configuration headache for our users.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] measuring spinning

2012-06-26 Thread Robert Haas
On Mon, Jun 18, 2012 at 9:36 PM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jun 16, 2012 at 6:25 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Well, this fell through the cracks, because I forgot to add it to the
 January CommitFest.  Here it is again, rebased.

 This applies and builds cleanly and passes make check (under enable-cassert).

 Not test or docs are needed for a patch of this nature.

 It does what it says, and we want it.

 I wondered if the change in the return signature of s_lock would have
 an affect on performance.  So I've run a series of pgbench -T 30 -P
 -c8 -j8, at a scale of 30 which fits in shared_buffers, using an
 Amazon c1.xlarge
 (8 cores).  I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in
 both cases), in random ordering.  The patch was 0.37% slower, average
 298483 selects per second patched to 299582 HEAD.  The difference is
 probably real (p value 0.042, one sided.) but is also pretty much
 negligible and could just be due to where the executable code falls in
 the cache lines which could move around with other changes to the
 code.

 I found this a bit surprising, so I retested on the IBM POWER7 box at
 concurrencies from 1 to 64 and found some test runs faster and some
 test runs slower - maybe a few more faster than slower.   I could do a
 more exact analysis, but I'm inclined to think that if there is an
 effect here, it's probably just in the noise, and that the real effect
 you measured was, as you say, the result of cache-line shifting or
 some other uninteresting artifact that could just as easily move back
 the other way on some other commit.  Really, s_lock should not be
 getting called often enough to matter much, and ignoring the return
 value of that function shouldn't cost anyone much.

 Two suggestions:

 In your original email you say number of pg_usleep() calls that are
 required to acquire each LWLock, but nothing in the code says this.
 Just reading lwlock.c I would naively assume it is reporting the
 number of TAS spins, not the number of spin-delays (and in fact that
 is what I did assume until I read your email more carefully).  A
 comment somewhere in lwlock.c would be helpful.

 Yeah, or maybe I should just change the printout to say spindelay
 instead of spin, and rename the variables likewise.

 Also in lwlock.c,

        if (sh_acquire_counts[i] || ex_acquire_counts[i] ||
 block_counts[i] || spin_counts[i])

 I don't think we can have spins (or blocks, for that matter) unless we
 have some acquires to have caused them, so the last two tests in that
 line seem to be noise.

 Perhaps so, but I think it's probably safe and clear to just follow
 the existing code pattern.

 Since my suggestions are minor, should I go ahead and mark this ready
 for committer?

 If you think it should be committed, yes.

So Jeff did that, and I've now committed the patch.

Thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed

2012-06-26 Thread Robert Haas
On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
 fujita.ets...@lab.ntt.co.jp wrote:
 I'm confused by this remark, because surely the query planner does it this
 way only if there's no LIMIT.  When there is a LIMIT, we choose based on
 the startup cost plus the estimated fraction of the total cost we expect
 to pay based on dividing the LIMIT by the overall row count estimate.  Or
 is this not what you're talking about?

 I think that Ants is pointing the way of estimating costs in
 choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for
 cheapest_path + hashagg, where the costs are calculated based on the total
 cost only of cheapest_path.  I think that it might be good to do cost_agg()
 for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED
 strategy.

 Well, Ants already made some adjustments to those functions; not sure
 if this means they need some more adjustment, but I don't see that
 there's a general problem with the costing algorithm around LIMIT.

Ants, do you intend to update this patch for this CommitFest?  Or at
all?  It seems nobody's too excited about this, so I'm not sure
whether it makes sense for you to put more work on it.  But please
advise as to your plans.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Alvaro Herrera

Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:
 
 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

I don't think that patch was all that reasonable.  It needed work, and
in any case it needs a rebase because it was pretty old.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.3] Row-Level Security

2012-06-26 Thread Kohei KaiGai
2012/6/26 Robert Haas robertmh...@gmail.com:
 Of course, here is some limitations, to keep the patch size reasonable
 level to review.
 - The permission to bypass RLS policy was under discussion.
  If and when we should provide a special permission to bypass RLS
  policy, the OR has_superuser_privilege() shall be replaced by
  OR has_table_privilege(tableoid, 'RLSBYPASS').

 I think you're missing the point.  Everyone who has commented on this
 issue is in favor of having some check that causes the RLS predicate
 *not to get added in the first place*.  Adding a modified predicate is
 not the same thing.  First, the query planner might not be smart
 enough to optimize away the clause even when the predicate holds,
 causing an unnecessary performance drain.  Second, there's too much
 danger of being able to set a booby-trap for the superuser this way.
 Suppose that the RLS subsystem replaces f_malicious() by f_malicious
 OR has_superuser_privilege().  Now the superuser comes along with the
 nightly pg_dump run and the query optimizer executes SELECT * FROM
 nuts WHERE f_malicious() OR has_superuser_privilege().  The query
 optimizer notes that the cost of f_malicious() is very low and decides
 to execute that before has_superuser_privilege().  Oops.  I think it's
 just not acceptable to handle this by clause-munging: we need to not
 add the clause in the first place.

Here is a simple idea to avoid the second problematic scenario; that
assign 0 as cost of has_superuser_privilege(). I allows to keep this
function more lightweight than any possible malicious function, since
CreateFunction enforces positive value.

But the first point is still remaining.

As you pointed out before, it might be a solution to have case-handling
for superusers and others in case of simple query protocol; that uses
same snapshot for planner and executor stage.

How should we handle the issue?

During the previous discussion, Tom mentioned about an idea that
saves prepared statement hashed with user-id to switch the query
plan depending on user's privilege.
Even though I hesitated to buy this idea at that time, it might be
worth to investigate this idea to satisfy both security and performance;
that will generate multiple query plans to be chosen at executor
stage later.

How about your opinion?

 Comments on the patch itself:

 1. Please do not abbreviate rowlevel to rowlv or RowLevel to RowLv or
 ROWLEVEL to ROWLV.  That makes it harder to read and harder to grep.
 Spell it out.

OK,

 2. Since the entirety of ATExecSetRowLvSecurity is conditional on
 whether clause != NULL, you might as well split it into two functions,
 one for each case.

OK,

 3. The fact that you've had to hack preprocess_targetlist and
 adjust_appendrel_attrs_mutator suggests to me that the insertion of
 the generate subquery is happening at the wrong phase of the process.
 We don't need those special cases for views, so it seems like we
 shouldn't need them here, either.

Main reason why I had to patch them is special case handling for
references to system columns; that is unavailable to have for sub-
queries.
But, I'm not 100% sure around these implementation. So, it needs
more investigations.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_terminate_backend for same-role

2012-06-26 Thread Robert Haas
On Thu, Mar 29, 2012 at 3:04 AM, Daniel Farina dan...@heroku.com wrote:
 On Mon, Mar 26, 2012 at 12:14 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2012-03-20 at 01:38 -0700, Daniel Farina wrote:
 On Thu, Mar 15, 2012 at 9:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
  On Fri, Mar 16, 2012 at 8:14 AM, Daniel Farina dan...@heroku.com wrote:
  Parallel to pg_cancel_backend, it'd be nice to allow the user to just
  outright kill a backend that they own (politely, with a SIGTERM),
  aborting any transactions in progress, including the idle transaction,
  and closing the socket.
 
  +1

 Here's a patch implementing the simple version, with no more guards
 against signal racing than have been seen previously.  The more
 elaborate variants to close those races is being discussed in a
 parallel thread, but I thought I'd get this simple version out there.

 Review:

 After reading through the threads, it looks like there was no real
 objection to this approach -- pid recycling is not something we're
 concerned about.

 I think you're missing a doc update though, in func.sgml:

 For the less restrictive functionpg_cancel_backend/, the role of an
 active backend can be found from
 the structfieldusename/structfield column of the
 structnamepg_stat_activity/structname view.

 Also, high-availability.sgml makes reference to pg_cancel_backend(), and
 it might be worthwhile to add an ...and pg_terminate_backend() there.

 Other than that, it looks good to me.

 Good comments. Patch attached to address the doc issues.  The only
 iffy thing is that the paragraph For the less restrictive... I have
 opted to remove in its entirely.  I think the documents are already
 pretty clear about the same-user rule, and I'm not sure if this is the
 right place for a crash-course on attributes in pg_stat_activity (but
 maybe it is).

 ...and pg_terminate_backend seems exactly right.

 And I think now that the system post-patch doesn't have such a strange
 contrast between the ability to send SIGINT vs. SIGTERM, such a
 contrast may not be necessary.

 I'm also not sure what the policy is about filling paragraphs in the
 manual.  I filled one, which increases the sgml churn a bit. git
 (show|diff) --word-diff helps clean it up.

I went ahead and committed this.

I kinda think we should back-patch this into 9.2.  It doesn't involve
a catalog change, and would make the behavior consistent between the
two releases, instead of changing in 9.1 and then changing again in
9.2.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.3] Row-Level Security

2012-06-26 Thread Tom Lane
Kohei KaiGai kai...@kaigai.gr.jp writes:
 2012/6/26 Robert Haas robertmh...@gmail.com:
 I think you're missing the point.  Everyone who has commented on this
 issue is in favor of having some check that causes the RLS predicate
 *not to get added in the first place*.

 Here is a simple idea to avoid the second problematic scenario; that
 assign 0 as cost of has_superuser_privilege().

I am not sure which part of this isn't safe isn't getting through to
you.  Aside from the scenarios Robert mentioned, consider the
possibility that f_malicious() is marked immutable, so that the planner
is likely to call it (to replace the call with its value) before it will
ever think about whether has_superuser_privilege should be called first.

Please just do what everybody is asking for, and create a bypass that
does not require fragile, easily-broken-by-future-changes assumptions
about what the planner will do with a WHERE clause.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap

2012-06-26 Thread Robert Haas
On Wed, May 2, 2012 at 9:01 PM, Josh Berkus j...@agliodbs.com wrote:
 On 5/2/12 10:20 AM, Jameison Martin wrote:
 Attached are the following as per various requests:
       * test_results.txt: the performance benchmarking results,

       * TestTrailingNull.java: the performance benchmarking code, with a few 
 additional scenarios as per various requests

       * hardinfo_report.txt: some information about the hardware and OS of 
 the system on which the benchmarks were run, and

       * postgresql.conf: the postgresql.conf used when running benchmarks. 
 Note that the changes made to the vanilla postgresql.conf can be identified 
 by looking for the string 'jamie' in the file I attached (there aren't that 
 many)

 Nice, thanks.  I'll try some of my own tests when I get a chance; I have
 a really good use-case for this optimization.

Josh,

The CommitFest application lists you as the reviewer for this patch.
Are you (I hope) planning to review it?

I see you posted up a follow-up email asking Tom what he had in mind.
Personally, I don't think this needs incredibly complicated testing.
I think you should just test a workload involving inserting and/or
updating rows with lots of trailing NULL columns, and then another
workload with a table of similar width that... doesn't.  If we can't
find a regression - or, better, we find a win in one or both cases -
then I think we're done here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2012-06-26 Thread Robert Haas
On Sat, Mar 3, 2012 at 3:22 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 When sorting small tuples, the memtuples array can use a substantial
 fraction of the total per-tuple memory used. (In the case of pass by
 value, it is all of it)

 The way it grows leads to sub-optimal memory usage.

Greg, I see you signed up to review this on the 14th, but I don't see
a review yet.  Are you planning to post one soon?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.

Yep, agreed.

I'd like to get this fixed too, but it hasn't made it up to the top of
my list of things to worry about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Tom Lane
Pavel Stehule pavel.steh...@gmail.com writes:
 it is not security issue - just I dislike sending complete stack, when
 just only one variable should be used.

That's a pretty darn weak argument.  If I read the patch correctly, what
you're proposing involves a dynamic fetch from the client at runtime,
which is going to be disastrous for performance.  Quite aside from the
network round trip involved, the fetch function would have to be marked
volatile (since it has client-visible side-effects, not to mention that
we don't know when the client might change the variable value); which
would really hurt any query involving it, and probably lead to yet more
round trips.

Pushing over the known values once at session start (and individual
values after updates) is likely to be vastly better-performant than
this.  Matters could be improved further by requiring variables to be
sent to the server to be explicitly marked, which seems like a good
idea anyway in case anybody has security concerns that they're not
going to let you airily dismiss.

Another thing I don't care for is the unannounced protocol extension.
This feature is just not interesting enough to justify breaking
client compatibility, but that's what it would do as proposed.
Clients that haven't heard of this 'v' message would probably
think they'd lost sync and drop the connection.

(BTW, the patch doesn't seem to include the added backend source file?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus
On 6/26/12 2:13 PM, Robert Haas wrote:
 On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.
 
 Yep, agreed.
 
 I'd like to get this fixed too, but it hasn't made it up to the top of
 my list of things to worry about.

Was there a post-AgentM version of the patch, which incorporated the
small SySV RAM partition?  I'm not finding it.


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proof concept - access to session variables on client side

2012-06-26 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Tue, Jun 26, 2012 at 3:05 AM, Magnus Hagander mag...@hagander.net wrote:
 But with a small change to psql they could, without the need for a
 whole new type of variable. For example, psql could set all those
 variable as psql.commandlinevarname, which could then be accessed
 from the DO PL code just fine.

 That's a really neat idea.

I do see a problem with this client-push idea, which is what happens if
psql sends a SET and later the active transaction gets rolled back.
psql does not have enough knowledge to be sure whether it lost the SET
or not.  It could hack things by always resending all variables after
any rollback, but ugh.

We could address that by inventing a non-transactional variant of SET,
perhaps.  Not sure it's worth the complication though --- I don't think
I want to have to define how that would interact with other variants
of SET in the same transaction ...

Another approach would be to define such variables as being truly
shared, in the spirit of last-update-wins multi master replication.
The backend sends over its values using the existing GUC_REPORT
mechanism.  So a rollback would cause the psql-side variable to revert
as well.  Not actually sure if that behavior would be more or less
useful than a simpler definition, but it's worth thinking about.

In this connection, there was some recent discussion in the jdbc list
of wanting clients to be able to set the GUC_REPORT flag on any GUC
variable, because the jdbc driver would like to track some settings we
have not seen fit to mark that way.  Not sure if anybody mentioned that
on -hackers yet, but it's coming.  If we had that ability then a
shared-variable behavior like this could be built entirely on the psql
side: the push part is just SET, and the pull part is GUC_REPORT.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Daniel Farina
On Tue, Jun 26, 2012 at 2:18 PM, Josh Berkus j...@agliodbs.com wrote:
 On 6/26/12 2:13 PM, Robert Haas wrote:
 On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.

 Yep, agreed.

 I'd like to get this fixed too, but it hasn't made it up to the top of
 my list of things to worry about.

 Was there a post-AgentM version of the patch, which incorporated the
 small SySV RAM partition?  I'm not finding it.

On that, I used to be of the opinion that this is a good compromise (a
small amount of interlock space, plus mostly posix shmem), but I've
heard since then (I think via AgentM indirectly, but I'm not sure)
that there are cases where even the small SysV segment can cause
problems -- notably when other software tweaks shared memory settings
on behalf of a user, but only leaves just-enough for the software
being installed.  This is most likely on platforms that don't have a
high SysV shmem limit by default, so installers all feel the
prerogative to increase the limit, but there's no great answer for how
to compose a series of such installations.  It only takes one
installer that says whatever, I'm just catenating stuff to
sysctl.conf that works for me to sabotage Postgres' ability to start.

So there may be a benefit in finding a way to have no SysV memory at
all.  I wouldn't let perfect be the enemy of good to make progress
here, but it appears this was a witnessed real problem, so it may be
worth reconsidering if there is a way we can safely remove all SysV by
finding an alternative to the nattach mechanic.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 5:18 PM, Josh Berkus j...@agliodbs.com wrote:
 On 6/26/12 2:13 PM, Robert Haas wrote:
 On Tue, Jun 26, 2012 at 4:29 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Excerpts from Josh Berkus's message of mar jun 26 15:49:59 -0400 2012:
 Robert, all:

 Last I checked, we had a reasonably acceptable patch to use mostly Posix
 Shared mem with a very small sysv ram partition.  Is there anything
 keeping this from going into 9.3?  It would eliminate a major
 configuration headache for our users.

 I don't think that patch was all that reasonable.  It needed work, and
 in any case it needs a rebase because it was pretty old.

 Yep, agreed.

 I'd like to get this fixed too, but it hasn't made it up to the top of
 my list of things to worry about.

 Was there a post-AgentM version of the patch, which incorporated the
 small SySV RAM partition?  I'm not finding it.

To my knowledge, no.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.  This is most likely on platforms that don't have a
 high SysV shmem limit by default, so installers all feel the
 prerogative to increase the limit, but there's no great answer for how
 to compose a series of such installations.  It only takes one
 installer that says whatever, I'm just catenating stuff to
 sysctl.conf that works for me to sabotage Postgres' ability to start.

Personally, I see this as rather an extreme case, and aside from AgentM
himself, have never run into it before.  Certainly it would be useful to
not need SysV RAM at all, but it's more important to get a working patch
for 9.3.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Martijn van Oosterhout
On Tue, Jun 26, 2012 at 01:46:06PM -0500, Merlin Moncure wrote:
 Well, that would introduce a backend dependency on pthreads, which is
 unpleasant.  Also you'd need to feature test via
 _POSIX_THREAD_PROCESS_SHARED to make sure you can mutex between
 processes (and configure your mutexes as such when you do).  There are
 probably other reasons why this can't be done, but I personally don' t
 klnow of any.

And then you have fabulous things like:

https://git.reviewboard.kde.org/r/102145/
(OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support
it.)

Seems not very well tested in any case.

It might be worthwhile testing futexes on Linux though, they are
specifically supported on any kind of shared memory (shm/mmap/fork/etc)
and quite well tested.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] pg_terminate_backend for same-role

2012-06-26 Thread Daniel Farina
On Tue, Jun 26, 2012 at 1:58 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Mar 29, 2012 at 3:04 AM, Daniel Farina dan...@heroku.com wrote:
 On Mon, Mar 26, 2012 at 12:14 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2012-03-20 at 01:38 -0700, Daniel Farina wrote:
 On Thu, Mar 15, 2012 at 9:39 PM, Fujii Masao masao.fu...@gmail.com wrote:
  On Fri, Mar 16, 2012 at 8:14 AM, Daniel Farina dan...@heroku.com wrote:
  Parallel to pg_cancel_backend, it'd be nice to allow the user to just
  outright kill a backend that they own (politely, with a SIGTERM),
  aborting any transactions in progress, including the idle transaction,
  and closing the socket.
 
  +1

 Here's a patch implementing the simple version, with no more guards
 against signal racing than have been seen previously.  The more
 elaborate variants to close those races is being discussed in a
 parallel thread, but I thought I'd get this simple version out there.

 Review:

 After reading through the threads, it looks like there was no real
 objection to this approach -- pid recycling is not something we're
 concerned about.

 I think you're missing a doc update though, in func.sgml:

 For the less restrictive functionpg_cancel_backend/, the role of an
 active backend can be found from
 the structfieldusename/structfield column of the
 structnamepg_stat_activity/structname view.

 Also, high-availability.sgml makes reference to pg_cancel_backend(), and
 it might be worthwhile to add an ...and pg_terminate_backend() there.

 Other than that, it looks good to me.

 Good comments. Patch attached to address the doc issues.  The only
 iffy thing is that the paragraph For the less restrictive... I have
 opted to remove in its entirely.  I think the documents are already
 pretty clear about the same-user rule, and I'm not sure if this is the
 right place for a crash-course on attributes in pg_stat_activity (but
 maybe it is).

 ...and pg_terminate_backend seems exactly right.

 And I think now that the system post-patch doesn't have such a strange
 contrast between the ability to send SIGINT vs. SIGTERM, such a
 contrast may not be necessary.

 I'm also not sure what the policy is about filling paragraphs in the
 manual.  I filled one, which increases the sgml churn a bit. git
 (show|diff) --word-diff helps clean it up.

 I went ahead and committed this.

 I kinda think we should back-patch this into 9.2.  It doesn't involve
 a catalog change, and would make the behavior consistent between the
 two releases, instead of changing in 9.1 and then changing again in
 9.2.  Thoughts?

I think that would be good.

It is at the level of pain whereas I would backpatch from day one, but
I think it would be a welcome change to people who couldn't justify
gnashing their teeth and distributing a tweaked Postgres like I would.
 It saves me some effort, too.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 5:44 PM, Josh Berkus j...@agliodbs.com wrote:

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.  This is most likely on platforms that don't have a
 high SysV shmem limit by default, so installers all feel the
 prerogative to increase the limit, but there's no great answer for how
 to compose a series of such installations.  It only takes one
 installer that says whatever, I'm just catenating stuff to
 sysctl.conf that works for me to sabotage Postgres' ability to start.

 Personally, I see this as rather an extreme case, and aside from AgentM
 himself, have never run into it before.  Certainly it would be useful to
 not need SysV RAM at all, but it's more important to get a working patch
 for 9.3.

+1.

I'd sort of given up on finding a solution that doesn't involve system
V shmem anyway, but now that I think about it... what about using a
FIFO?  The man page for open on MacOS X says:

[ENXIO]O_NONBLOCK and O_WRONLY are set, the file is a FIFO,
   and no process has it open for reading.

And Linux says:

  ENXIO  O_NONBLOCK | O_WRONLY is set, the named file is a  FIFO  and  no
 process has the file open for reading.  Or, the file is a device
 special file and no corresponding device exists.

And HP/UX says:

  [ENXIO]O_NDELAY is set, the named file is a FIFO,
 O_WRONLY is set, and no process has the file open
 for reading.

So, what about keeping a FIFO in the data directory?  When the
postmaster starts up, it tries to open the file with O_NONBLOCK |
O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
anything other than ENXIO, it bails out.  If it fails with exactly
ENXIO, then it opens the pipe with O_RDONLY and arranges to pass the
file descriptor down to all of its children, so that a subsequent open
will fail if it or any of its children are still alive.

This might even be more reliable than what we do right now, because
our current system appears not to be robust against the removal of
postmaster.pid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Alvaro Herrera

Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.

This argument is what killed the original patch.  If you want to get
anything done *at all* I think it needs to be dropped.  Changing shmem
implementation is already difficult enough --- you don't need to add the
requirement that the interlocking mechanism be changed simultaneously.
You (or whoever else) can always work on that as a followup patch.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread Kevin Grittner
David Fetter da...@fetter.org wrote:
 On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
 
 One fine point regarding before and after images -- if a value
 doesn't change in an UPDATE, there's no reason to include it in
 both the BEFORE and AFTER tuple images, as long as we have the
 null column bitmaps -- or some other way of distinguishing
 unchanged from NULL.  (This could be especially important when
 the unchanged column was a 50 MB bytea.)
 
 How about two bitmaps: one telling which columns are actually
 there, the other with NULLs?
 
There are quite a few ways that could be done, but I suspect
Álvaro's idea is best:
 
http://archives.postgresql.org/message-id/1340654533-sup-5...@alvh.no-ip.org
 
In any event, it sounds like Andres wants to keep it as simple as
possible for the moment, and just include both tuples in their
entirety.  Hopefully that is something which can be revisited before
the last CF.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Tom Lane
Martijn van Oosterhout klep...@svana.org writes:
 And then you have fabulous things like:
 https://git.reviewboard.kde.org/r/102145/
 (OSX defines _POSIX_THREAD_PROCESS_SHARED but does not actually support
 it.)

 Seems not very well tested in any case.

 It might be worthwhile testing futexes on Linux though, they are
 specifically supported on any kind of shared memory (shm/mmap/fork/etc)
 and quite well tested.

Yeah, a Linux-specific replacement of spinlocks with futexes seems like
a lot safer idea than let's rely on posix mutexes everywhere.  It's
still unproven whether it'd be an improvement, but you could expect to
prove it one way or the other with a well-defined amount of testing.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Daniel Farina
On Tue, Jun 26, 2012 at 2:53 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:

 Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:

 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.

 This argument is what killed the original patch.  If you want to get
 anything done *at all* I think it needs to be dropped.  Changing shmem
 implementation is already difficult enough --- you don't need to add the
 requirement that the interlocking mechanism be changed simultaneously.
 You (or whoever else) can always work on that as a followup patch.

True, but then again, I did very intentionally write:

 Excerpts from Daniel Farina's message of mar jun 26 17:40:16 -0400 2012:
 *I wouldn't let perfect be the enemy of good* to make progress
 here, but it appears this was a witnessed real problem, so it may
 be worth reconsidering if there is a way we can safely remove all
 SysV by finding an alternative to the nattach mechanic.

(Emphasis mine).

I don't think that -hackers at the time gave the zero-shmem rationale
much weight (I also was not that happy about the safety mechanism of
that patch), but upon more reflection (and taking into account *other*
software that may mangle shmem settings) I think it's something at
least worth thinking about again one more time.  What killed the patch
was an attachment to the deemed-less-safe stategy for avoiding bogus
shmem attachments already in it, but I don't seem to recall anyone
putting a whole lot of thought at the time into the zero-shmem case
from what I could read on the list, because a small interlock with
nattach seemed good-enough.

I'm simply suggesting that for additional benefits it may be worth
thinking about getting around nattach and thus SysV shmem, especially
with regard to safety, in an open-ended way.  Maybe there's a solution
(like Robert's FIFO suggestion?) that is not too onerous and can
satisfy everyone.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On Jun 26, 2012, at 5:44 PM, Josh Berkus wrote:

 
 On that, I used to be of the opinion that this is a good compromise (a
 small amount of interlock space, plus mostly posix shmem), but I've
 heard since then (I think via AgentM indirectly, but I'm not sure)
 that there are cases where even the small SysV segment can cause
 problems -- notably when other software tweaks shared memory settings
 on behalf of a user, but only leaves just-enough for the software
 being installed.  This is most likely on platforms that don't have a
 high SysV shmem limit by default, so installers all feel the
 prerogative to increase the limit, but there's no great answer for how
 to compose a series of such installations.  It only takes one
 installer that says whatever, I'm just catenating stuff to
 sysctl.conf that works for me to sabotage Postgres' ability to start.
 
 Personally, I see this as rather an extreme case, and aside from AgentM
 himself, have never run into it before.  Certainly it would be useful to
 not need SysV RAM at all, but it's more important to get a working patch
 for 9.3.


This can be trivially reproduced if one runs an old (SysV shared memory-based) 
postgresql alongside a potentially newer postgresql with a smaller SysV 
segment. This can occur with applications that bundle postgresql as part of the 
app.

Cheers,
M




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Josh Berkus

 This can be trivially reproduced if one runs an old (SysV shared 
 memory-based) postgresql alongside a potentially newer postgresql with a 
 smaller SysV segment. This can occur with applications that bundle postgresql 
 as part of the app.

I'm not saying it doesn't happen at all.  I'm saying it's not the 80%
case.

So let's fix the 80% case with something we feel confident in, and then
revisit the no-sysv interlock as a separate patch.  That way if we can't
fix the interlock issues, we still have a reduced-shmem version of Postgres.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So, what about keeping a FIFO in the data directory?

Hm, does that work if the data directory is on NFS?  Or some other weird
not-really-Unix file system?

 When the
 postmaster starts up, it tries to open the file with O_NONBLOCK |
 O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
 than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
 anything other than ENXIO, it bails out.  If it fails with exactly
 ENXIO, then it opens the pipe with O_RDONLY

... race condition here ...

 and arranges to pass the
 file descriptor down to all of its children, so that a subsequent open
 will fail if it or any of its children are still alive.

This might be made to work, but that doesn't sound quite right in
detail.

I remember we speculated about using an fcntl lock on some file in the
data directory, but that fails because child processes don't inherit
fcntl locks.

In the modern world, it'd be really a step forward if the lock mechanism
worked on shared storage, ie a data directory on NFS or similar could be
locked against all comers not just those on the same node as the
original postmaster.  I don't know how to do that though.

In the meantime, insisting that we solve this problem before we do
anything is a good recipe for ensuring that nothing happens, just
like it hasn't happened for the last half dozen years.  (I see Alvaro
just made the same point.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
 
 (Emphasis mine).
 
 I don't think that -hackers at the time gave the zero-shmem rationale
 much weight (I also was not that happy about the safety mechanism of
 that patch), but upon more reflection (and taking into account *other*
 software that may mangle shmem settings) I think it's something at
 least worth thinking about again one more time.  What killed the patch
 was an attachment to the deemed-less-safe stategy for avoiding bogus
 shmem attachments already in it, but I don't seem to recall anyone
 putting a whole lot of thought at the time into the zero-shmem case
 from what I could read on the list, because a small interlock with
 nattach seemed good-enough.
 
 I'm simply suggesting that for additional benefits it may be worth
 thinking about getting around nattach and thus SysV shmem, especially
 with regard to safety, in an open-ended way.  Maybe there's a solution
 (like Robert's FIFO suggestion?) that is not too onerous and can
 satisfy everyone.


I solved this via fcntl locking. I also set up gdb to break in critical regions 
to test the interlock and I found no flaw in the design. More eyes would be 
welcome, of course.
https://github.com/agentm/postgres/tree/posix_shmem

Cheers,
M




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 So let's fix the 80% case with something we feel confident in, and then
 revisit the no-sysv interlock as a separate patch.  That way if we can't
 fix the interlock issues, we still have a reduced-shmem version of Postgres.

Yes.  Insisting that we have the whole change in one patch is a good way
to prevent any forward progress from happening.  As Alvaro noted, there
are plenty of issues to resolve without trying to change the interlock
mechanism at the same time.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 In the meantime, insisting that we solve this problem before we do
 anything is a good recipe for ensuring that nothing happens, just
 like it hasn't happened for the last half dozen years.  (I see
 Alvaro just made the same point.)
 
And now so has Josh.
 
+1 from me, too.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 This can be trivially reproduced if one runs an old (SysV shared 
 memory-based) postgresql alongside a potentially newer postgresql with a 
 smaller SysV segment. This can occur with applications that bundle postgresql 
 as part of the app.

I don't believe that that case is a counterexample to what's being
proposed (namely, grabbing a minimum-size shmem segment, perhaps 1K).
It would only fail if the old postmaster ate up *exactly* SHMMAX worth
of shmem, which is not real likely.  As a data point, on my Mac laptop
with SHMMAX set to 32MB, 9.2 will by default eat up 31624KB, leaving
more than a meg available.  Sure, that isn't enough to start another
old-style postmaster, but it would be plenty of room for one that only
wants 1K.

Even if you actively try to configure the shmem settings to exactly
fill shmmax (which I concede some installation scripts might do),
it's going to be hard to do because of the 8K granularity of the main
knob, shared_buffers.  Moreover, a installation script that did that
would soon learn not to, because of the fact that we don't worry too
much about changing small details of shared memory consumption in minor
releases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-06-26 Thread Nils Goroll
 It's
 still unproven whether it'd be an improvement, but you could expect to
 prove it one way or the other with a well-defined amount of testing.

I've hacked the code to use adaptive pthread mutexes instead of spinlocks. see
attached patch. The patch is for the git head, but it can easily be applied for
9.1.3, which is what I did for my tests.

This had disastrous effects on Solaris because it does not use anything similar
to futexes for PTHREAD_PROCESS_SHARED mutexes (only the _PRIVATE mutexes do
without syscalls for the simple case).

But I was surprised to see that it works relatively well on linux. Here's a
glimpse of my results:

hacked code 9.1.3:

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502  ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
ransaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 476.873261 (including connections establishing)
tps = 485.964355 (excluding connections establishing)
LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
-bash-4.1$ LOG:  shutting down
LOG:  database system is shut down
210.58user 78.88system 0:50.64elapsed 571%CPU (0avgtext+0avgdata
1995968maxresident)k
0inputs+1153872outputs (0major+2464649minor)pagefaults 0swaps

original code (vanilla build on amd64) 9.1.3:

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502  ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 499.993685 (including connections establishing)
tps = 510.410883 (excluding connections establishing)
LOG:  received smart shutdown request
-bash-4.1$ LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
196.21user 71.38system 0:47.99elapsed 557%CPU (0avgtext+0avgdata
1360800maxresident)k
0inputs+1147904outputs (0major+2375965minor)pagefaults 0swaps


config:

-bash-4.1$ egrep '^[a-z]' /tmp/test_template_data/postgresql.conf
max_connections = 1800  # (change requires restart)
shared_buffers = 10GB   # min 128kB
temp_buffers = 64MB # min 800kB
work_mem = 256MB# min 64kB,d efault 1MB
maintenance_work_mem = 2GB  # min 1MB, default 16MB
bgwriter_delay = 10ms   # 10-1ms between rounds
bgwriter_lru_maxpages = 1000# 0-1000 max buffers written/round
bgwriter_lru_multiplier = 10.0  # 0-10.0 multipler on buffers 
scanned/round
wal_level = hot_standby # minimal, archive, or hot_standby
wal_buffers = 64MB  # min 32kB, -1 sets based on 
shared_buffers
commit_delay = 1# range 0-10, in microseconds
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error 
message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'  # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 1.5  # same scale as above (default: 4.0)
cpu_tuple_cost = 0.005
cpu_index_tuple_cost = 0.0025
cpu_operator_cost = 0.0001
effective_cache_size = 192GB



So it looks like using pthread_mutexes could at least be an option on Linux.

Using futexes directly could be even cheaper.


As a side note, it looks like I have not expressed myself clearly:

I did not intend to suggest to replace proven, working code (which probably is
the best you can get for some platforms) with posix calls. I apologize for the
provocative question.


Regarding the actual production issue, I did not manage to synthetically provoke
the saturation we are seeing in production using pgbench - I could not even get
anywhere near the production load. So I cannot currently test if reducing the
amount of spinning and waking up exactly one waiter (which is what linux/nptl
pthread_mutex_unlock does) would solve/mitigate the production issue I am
working on, and I'd highly appreciate any pointers in this direction.

Cheers, Nils
diff --git a/src/backend/storage/lmgr/s_lock.c 
b/src/backend/storage/lmgr/s_lock.c
index bc8d89f..a45fdf6 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -20,6 +20,8 @@
 
 

Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Alvaro Herrera

Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:

 Even if you actively try to configure the shmem settings to exactly
 fill shmmax (which I concede some installation scripts might do),
 it's going to be hard to do because of the 8K granularity of the main
 knob, shared_buffers.

Actually it's very easy -- just try to start postmaster on a system with
not enough shmmax and it will tell you how much shmem it wants.  Then
copy that number verbatim in the config file.  This might fail on picky
systems such as MacOSX that require some exact multiple or power of some
other parameter, but it works fine on Linux.

I think the minimum you can request, at least on Linux, is 1 byte.

 Moreover, a installation script that did that
 would soon learn not to, because of the fact that we don't worry too
 much about changing small details of shared memory consumption in minor
 releases.

+1

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] embedded list v2

2012-06-26 Thread Andres Freund
Hi,

To recapitulate why I think this sort of embedded list is worthwile:
* minimal memory overhead (16 bytes for double linked list heads/nodes on 
64bit systems)
* no additional memory allocation overhead
* no additional dereference to access the contents of a list element
* most modifications are completely branchless
* the current dllist.h interface has double the memory overhead and much more 
complex manipulation operators
* Multiple places in postgres have grown local single or double linked list 
implementations
* I need it ;)

Attached are three patches:
1. embedded list implementation
2. make the list implementation usable without USE_INLINE
3. convert all callers to ilist.h away from dllist.h

For 1 I:
a. added more comments and some introduction, some more functions
b. moved the file from utils/ilist.h to lib/ilist.h
c. actually included the c file with the check functions
d. did *not* split it up into single/double linked list files, doesn't seem to 
be worth the trouble given how small ilist.(c|h) are
e. did *not* try to get an interface similar to dllist.h. I don't think the 
old one is better and it makes the breakage more obvious should somebody else 
use the old implementation although I doubt it.

I can be convinced to do d. and e. but I don't think they are an improvement.

For 2 I used ugly macro hackery to avoid declaring every function twice, in a 
c file and in a header.

Opinions on the state of the above patches?

I did not expect any performance difference in the current usage, but just to 
be sure I ran the following tests:

connection heavy:
pgbench -n -S  -p 5501 -h /tmp -U andres postgres -c 16 -j 16 -T 10 -C
master: 3109 3024 3012
ilist:  3097 3033 3024

somewhat SearchCatCache heavy:
pgbench -n -S  -p 5501 -h /tmp -U andres postgres -T 100 -c 16 -j 1
master: 98979.453879 99554.485631 99393.587880
ilist:  98960.545559 99583.319870 99498.923273

As expected the differences are on the level of noise...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 2e9d955fbb625004061509a62ecca83fde68d027 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sun, 6 May 2012 00:26:35 +0200
Subject: [PATCH 1/3] Add embedded list interface (header only)

Adds a single and a double linked list which can easily embedded into other
datastructures and can be used without any additional allocations.

Problematic: It requires USE_INLINE to be used. It could be remade to fallback
to to externally defined functions if that is not available but that hardly
seems sensibly at this day and age. Besides, the speed hit would be noticeable
and its only used in new code which could be disabled on machines - given they
still exists - without proper support for inline functions

 3 files changed, 509 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 2e1061e..c94297a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dllist.o stringinfo.o
+OBJS = dllist.o stringinfo.o ilist.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/ilist.c b/src/backend/lib/ilist.c
new file mode 100644
index 000..72de7a3
--- /dev/null
+++ b/src/backend/lib/ilist.c
@@ -0,0 +1,79 @@
+/*-
+ *
+ * ilist.c
+ *	  support for integrated/inline double and single linked lists
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/ilist.c
+ *
+ * NOTES
+ *
+ *	  This function only contains testing code for inline single/double linked
+ *	  lists.
+ *
+ *-
+ */
+
+#include postgres.h
+
+#include lib/ilist.h
+
+#ifdef ILIST_DEBUG
+void ilist_d_check(ilist_d_head* head)
+{
+ilist_d_node *cur;
+
+if(!head ||
+   !(head-head))
+	elog(ERROR, double linked list head is not properly initialized);
+
+for(cur = head-head.next;
+cur != head-head;
+cur = cur-next){
+	if(!(cur) ||
+	   !(cur-next) ||
+	   !(cur-prev) ||
+	   !(cur-prev-next == cur) ||
+	   !(cur-next-prev == cur))
+	{
+		elog(ERROR, double linked list is corrupted);
+	}
+}
+
+for(cur = head-head.prev;
+cur != head-head;
+cur = cur-prev){
+	if(!(cur) ||
+	   !(cur-next) ||
+	   !(cur-prev) ||
+	   !(cur-prev-next == cur) ||
+	   !(cur-next-prev == cur))
+	{
+		elog(ERROR, double linked list is corrupted);
+	}
+}
+}
+
+void ilist_s_check(ilist_s_head* head)
+{
+ilist_s_node *cur;
+
+if(!head ||
+   !(head-head))
+	elog(ERROR, single 

Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:
 I'm simply suggesting that for additional benefits it may be worth
 thinking about getting around nattach and thus SysV shmem, especially
 with regard to safety, in an open-ended way.

 I solved this via fcntl locking.

No, you didn't, because fcntl locks aren't inherited by child processes.
Too bad, because they'd be a great solution otherwise.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes

2012-06-26 Thread Andres Freund
Hi Steve,

On Tuesday, June 26, 2012 02:14:22 AM Steve Singer wrote:
 I planned to have some cutoff 'max_changes_in_memory_per_txn' value.
  If it has
  been reached for one transaction all existing changes are spilled to
  disk. New changes again can be kept in memory till its reached again.
 Do you want max_changes_per_in_memory_txn or do you want to put a limit
 on the total amount of memory that the cache is able to use? How are you
 going to tell a DBA to tune max_changes_in_memory_per_txn? They know how
 much memory their system has and that they can devote to the apply cache
 versus other things, giving them guidance on how estimating how much
 open transactions they might have at a point in time  and how many
 WAL change records each transaction generates seems like a step
 backwards from the progress we've been making in getting Postgresql to
 be easier to tune.  The maximum number of transactions that could be
 opened at a time is governed by max_connections on the master at the
 time the WAL was generated , so I don't even see how the machine
 processing the WAL records could autotune/guess that.
It even can be significantly higher than max_connections because 
subtransactions are only recognizable as part of their parent transaction 
uppon commit.

I think max_changes_in_memory_per_txn will be the number of changes for now. 
Making memory based accounting across multiple concurrent transactions work 
efficiently and correctly isn't easy.


  We need to support serializing the cache for crash recovery + shutdown of
  the receiving side as well. Depending on how we do the wal decoding we
  will need it more frequently...
 Have you described your thoughts on crash recovery on another thread?
I think I have somewhere, but given how much in flux our thoughts on decoding 
are I think its not that important yet.

 I am thinking that this module would have to serialize some state
 everytime it calls cache-commit() to ensure that consumers don't get
 invoked twice on the same transaction.
In one of the other patches I implemented it by adding the (origin_id, 
origin_lsn) pair to replicated commits. During recovery the startup process 
sets up the shared memory status up to which point we applied.
If you then every now and then perform a 'logical checkpoint' writing down 
whats the beginning lsn of the longest in-progress transaction is you can 
fully recover from that point on.

 If the apply module is making changes to the same backend that the apply
 cache serializes to then both the state for the apply cache and the
 changes that committed changes/transactions make will be persisted (or
 not persisted) together.   What if I am replicating from x86 to x86_64
 via a apply module that does textout conversions?
 
 x86 Proxy x86_64
 WAL-- apply
   cache
 
|   (proxy catalog)
 
   apply module
textout  -
SQL statements
 
 
 How do we ensure that the commits are all visible(or not visible)  on
 the catalog on the proxy instance used for decoding WAL, the destination
 database, and the state + spill files of the apply cache stay consistent
 in the event of a crash of either the proxy or the target?
 I don't think you can (unless we consider two-phase commit, and I'd
 rather we didn't).  Can we come up with a way of avoiding the need for
 them to be consistent with each other?
Thats discussed in the Catalog/Metadata consistency during changeset 
extraction from wal thread and we haven't yet determined which solution is 
the best ;)

 Code Review
 =
 
 applycache.h
 ---
 +typedef struct ApplyCacheTupleBuf
 +{
 +/* position in preallocated list */
 +ilist_s_node node;
 +
 +HeapTupleData tuple;
 +HeapTupleHeaderData header;
 +char data[MaxHeapTupleSize];
 +} ApplyCacheTupleBuf;
 
 Each ApplyCacheTupleBuf will be about 8k (BLKSZ) big no matter how big
 the data in the transaction is? Wouldn't workloads with inserts of lots
 of small rows in a transaction eat up lots of memory that is allocated
 but storing nothing?  The only alternative I can think of is dynamically
 allocating these and I don't know what the cost/benefit of that overhead
 will be versus spilling to disk sooner.
Dynamically allocating them totally destroys performance, I tried that. I 
think at some point we should have 4 or so list of preallocated tuple bufs of 
different sizes and then use the smallest possible one. But I think this 
solution is ok in the very first version.

If you allocate dynamically you also get a noticeable performance drop when 
you let the decoding run for a while because of fragmentation inside the 
memory allocator.

 +* FIXME: better name
 + */
 +ApplyCacheChange*
 +ApplyCacheGetChange(ApplyCache*);
 
 How about:
 
 ApplyCacheReserveChangeStruct(..)
 

Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On 06/26/2012 07:30 PM, Tom Lane wrote:

A.M. age...@themactionfaction.com writes:

On Jun 26, 2012, at 6:12 PM, Daniel Farina wrote:

I'm simply suggesting that for additional benefits it may be worth
thinking about getting around nattach and thus SysV shmem, especially
with regard to safety, in an open-ended way.



I solved this via fcntl locking.


No, you didn't, because fcntl locks aren't inherited by child processes.
Too bad, because they'd be a great solution otherwise.



You claimed this last time and I replied:
http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

I address this race condition by ensuring that a lock-holding violator 
is the postmaster or a postmaster child. If such as condition is 
detected, the child exits immediately without touching the shared 
memory. POSIX shmem is inherited via file descriptors.


This is possible because the locking API allows one to request which PID 
violates the lock. The child expects the lock to be held and checks that 
the PID is the parent. If the lock is not held, that means that the 
postmaster is dead, so the child exits immediately.


Cheers,
M

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread A.M.

On 06/26/2012 07:15 PM, Alvaro Herrera wrote:


Excerpts from Tom Lane's message of mar jun 26 18:58:45 -0400 2012:


Even if you actively try to configure the shmem settings to exactly
fill shmmax (which I concede some installation scripts might do),
it's going to be hard to do because of the 8K granularity of the main
knob, shared_buffers.


Actually it's very easy -- just try to start postmaster on a system with
not enough shmmax and it will tell you how much shmem it wants.  Then
copy that number verbatim in the config file.  This might fail on picky
systems such as MacOSX that require some exact multiple or power of some
other parameter, but it works fine on Linux.



Except that we have to account for other installers. A user can install 
an application in the future which clobbers the value and then the 
original application will fail to run. The options to get the first app 
working is:


a) to re-install the first app (potentially preventing the second app 
from running)
b) to have the first app detect the failure and readjust the value 
(guessing what it should be) and potentially forcing a reboot
c) to have the the user manually adjust the value and potentially force 
a reboot


The failure usually gets blamed on the first application.

That's why we had to nuke SysV shmem.

Cheers,
M



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 6:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 So, what about keeping a FIFO in the data directory?

 Hm, does that work if the data directory is on NFS?  Or some other weird
 not-really-Unix file system?

I would expect NFS to work in general.  We could test that.  Of
course, it's more than possible that there's some bizarre device out
there that purports to be NFS but doesn't actually support mkfifo.
It's difficult to prove a negative.

 When the
 postmaster starts up, it tries to open the file with O_NONBLOCK |
 O_WRONLY (or O_NDELAY | O_WRONLY, if the platform has O_NDELAY rather
 than O_NONBLOCK).  If that succeeds, it bails out.  If it fails with
 anything other than ENXIO, it bails out.  If it fails with exactly
 ENXIO, then it opens the pipe with O_RDONLY

 ... race condition here ...

Oh, if someone tries to start two postmasters at the same time?  Hmm.

 and arranges to pass the
 file descriptor down to all of its children, so that a subsequent open
 will fail if it or any of its children are still alive.

 This might be made to work, but that doesn't sound quite right in
 detail.

 I remember we speculated about using an fcntl lock on some file in the
 data directory, but that fails because child processes don't inherit
 fcntl locks.

 In the modern world, it'd be really a step forward if the lock mechanism
 worked on shared storage, ie a data directory on NFS or similar could be
 locked against all comers not just those on the same node as the
 original postmaster.  I don't know how to do that though.

Well, I think that in theory that DOES work.  But I also think it's
often misconfigured.  Which could also be said of NFS in general.

 In the meantime, insisting that we solve this problem before we do
 anything is a good recipe for ensuring that nothing happens, just
 like it hasn't happened for the last half dozen years.  (I see Alvaro
 just made the same point.)

Agreed all around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
A.M. age...@themactionfaction.com writes:
 On 06/26/2012 07:30 PM, Tom Lane wrote:
 I solved this via fcntl locking.

 No, you didn't, because fcntl locks aren't inherited by child processes.
 Too bad, because they'd be a great solution otherwise.

 You claimed this last time and I replied:
 http://archives.postgresql.org/pgsql-hackers/2011-04/msg00656.php

 I address this race condition by ensuring that a lock-holding violator 
 is the postmaster or a postmaster child. If such as condition is 
 detected, the child exits immediately without touching the shared 
 memory. POSIX shmem is inherited via file descriptors.

 This is possible because the locking API allows one to request which PID 
 violates the lock. The child expects the lock to be held and checks that 
 the PID is the parent. If the lock is not held, that means that the 
 postmaster is dead, so the child exits immediately.

OK, I went back and re-read the original patch, and I now agree that
something like this is possible --- but I don't like the way you did
it. The dependence on particular PIDs seems both unnecessary and risky.

The key concept here seems to be that the postmaster first stakes a
claim on the data directory by exclusive-locking a lock file.  If
successful, it reduces that lock to shared mode (which can be done
atomically, according to the SUS fcntl specification), and then holds
the shared lock until it exits.  Spawned children will not initially
have a lock, but what they can do is attempt to acquire shared lock on
the lock file.  If fail, exit.  If successful, *check to see that the
parent postmaster is still alive* (ie, getppid() != 1).  If so, the
parent must have been continuously holding the lock, and the child has
successfully joined the pool of shared lock holders.  Otherwise bail
out without having changed anything.  It is the parent is still alive
check, not any test on individual PIDs, that makes this work.

There are two concrete reasons why I don't care for the
GetPIDHoldingLock() way.  Firstly, the fact that you can get a blocking
PID from F_GETLK isn't an essential part of the concept of file locking
IMO --- it's just an incidental part of this particular API.  May I
remind you that the reason we're stuck on SysV shmem in the first place
is that we decided to depend on an incidental part of that API, namely
nattch?  I would like to not require file locking to have any semantics
more specific than a process can hold an exclusive or a shared lock on
a file, which is auto-released at process exit.  Secondly, in an NFS
world I don't believe that the returned l_pid value can be trusted for
anything.  If it's a PID from a different machine then it might
accidentally conflict with one on our machine, or not.

Reflecting on this further, it seems to me that the main remaining
failure modes are (1) file locking doesn't work, or (2) idiot DBA
manually removes the lock file.  Both of these could be ameliorated
with some refinements to the basic idea.  For (1), I suggest that
we tweak the startup process (only) to attempt to acquire exclusive lock
on the lock file.  If it succeeds, we know that file locking is broken,
and we can complain.  (This wouldn't help for cases where cross-machine
locking is broken, but I see no practical way to detect that.)
For (2), the problem really is that the proposed patch conflates the PID
file with the lock file, but people are conditioned to think that PID
files are removable.  I suggest that we create a separate, permanently
present file that serves only as the lock file and doesn't ever get
modified (it need have no content other than the string Don't remove
this!).  It'd be created by initdb, not by individual postmaster runs;
indeed the postmaster should fail if it doesn't find the lock file
already present.  The postmaster PID file should still exist with its
current contents, but it would serve mostly as documentation and as
server-contact information for pg_ctl; it would not be part of the data
directory locking mechanism.

I wonder whether this design can be adapted to Windows?  IIRC we do
not have a bulletproof data directory lock scheme for Windows.
It seems like this makes few enough demands on the lock mechanism
that there ought to be suitable primitives available there too.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

2012-06-26 Thread David Fetter
On Tue, Jun 26, 2012 at 05:05:27PM -0500, Kevin Grittner wrote:
 David Fetter da...@fetter.org wrote:
  On Mon, Jun 25, 2012 at 01:50:54PM -0500, Kevin Grittner wrote:
  
  One fine point regarding before and after images -- if a value
  doesn't change in an UPDATE, there's no reason to include it in
  both the BEFORE and AFTER tuple images, as long as we have the
  null column bitmaps -- or some other way of distinguishing
  unchanged from NULL.  (This could be especially important when
  the unchanged column was a 50 MB bytea.)
  
  How about two bitmaps: one telling which columns are actually
  there, the other with NULLs?
  
 There are quite a few ways that could be done, but I suspect
 Álvaro's idea is best:
  
 http://archives.postgresql.org/message-id/1340654533-sup-5...@alvh.no-ip.org

Looks great (or at least way better than mine) to me :)

 In any event, it sounds like Andres wants to keep it as simple as
 possible for the moment, and just include both tuples in their
 entirety.  Hopefully that is something which can be revisited before
 the last CF.

I hope so, too...

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
I wrote:
 Reflecting on this further, it seems to me that the main remaining
 failure modes are (1) file locking doesn't work, or (2) idiot DBA
 manually removes the lock file.

Oh, wait, I just remembered the really fatal problem here: to quote from
the SUS fcntl spec,

All locks associated with a file for a given process are removed
when a file descriptor for that file is closed by that process
or the process holding that file descriptor terminates.

That carefully says a file descriptor, not the file descriptor
through which the lock was acquired.  Any close() referencing the lock
file will do.  That means that it is possible for perfectly innocent
code --- for example, something that scans all files in the data
directory, as say pg_basebackup might do --- to cause a backend process
to lose its lock.  When we looked at this before, it seemed like a
showstopper.  Even if we carefully taught every directory-scanning loop
in postgres not to touch the lock file, we cannot expect that for
instance a pl/perl function wouldn't accidentally break things.  And
99.999% of the time nobody would notice ... it would just be that last
0.001% of people that would be screwed.

Still, this discussion has yielded a useful advance, which is that we
now see how we might safely make use of lock mechanisms that don't
inherit across fork().  We just need something less broken than fcntl().

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file foreign tables

2012-06-26 Thread Etsuro Fujita
Hi Kaigai-san,

 -Original Message-
 From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp]
 Sent: Tuesday, June 26, 2012 11:05 PM
 To: Etsuro Fujita
 Cc: Robert Haas; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
 foreign tables
 
 2012/6/26 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
  Hi Kaigai-san,
 
  -Original Message-
  From: Kohei KaiGai [mailto:kai...@kaigai.gr.jp]
  Sent: Monday, June 25, 2012 9:49 PM
  To: Etsuro Fujita
  Cc: Robert Haas; pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV file
  foreign tables
 
  Fujita-san,
 
  The revised patch is almost good for me.
  Only point I noticed is the check for redundant or duplicated options I
  pointed
  out on the previous post.
  So, if you agree with the attached patch, I'd like to hand over this patch
 for
  committers.
 
  OK I agree with you.  Thanks for the revision!
 
  Besides the revision, I modified check_selective_binary_conversion() to run
  heap_close() in the whole-row-reference case.  Attached is an updated
version
 of
  the patch.
 
 Sorry, I overlooked this code path.

No, It's my fault.

 So, I'd like to take over this patch for committers.

Thanks,

Best regards,
Etsuro Fujita

 
 Thanks,
 
  Thanks.
 
  Best regards,
  Etsuro Fujita
 
  All I changed is:
   --- a/src/backend/commands/copy.c
   +++ b/src/backend/commands/copy.c
   @@ -122,6 +122,11 @@ typedef struct CopyStateData @@ -217,7 +221,7 @@
index
  98bcb2f..0143d81 100644
          }
   +      else if (strcmp(defel-defname, convert_binary) == 0)
   +      {
  -+          if (cstate-convert_binary)
  ++          if (cstate-convert_selectively)
   +              ereport(ERROR,
   +                      (errcode(ERRCODE_SYNTAX_ERROR),
   +                       errmsg(conflicting or redundant options)));
 
  Thanks,
 
  2012/6/20 Etsuro Fujita fujita.ets...@lab.ntt.co.jp:
   Hi KaiGai-san,
  
   Thank you for the review.
  
   -Original Message-
   From: pgsql-hackers-ow...@postgresql.org
   [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Kohei KaiGai
   Sent: Wednesday, June 20, 2012 1:26 AM
   To: Etsuro Fujita
   Cc: Robert Haas; pgsql-hackers@postgresql.org
   Subject: Re: [HACKERS] WIP Patch: Selective binary conversion of CSV
   file foreign tables
  
   Hi Fujita-san,
  
   Could you rebase this patch towards the latest tree?
   It was unavailable to apply the patch cleanly.
  
   Sorry, I updated the patch.  Please find attached an updated version
   of the patch.
  
   I looked over the patch, then noticed a few points.
  
   At ProcessCopyOptions(), defel-arg can be NIL, isn't it?
   If so, cstate-convert_binary is not a suitable flag to check
   redundant option. It seems to me cstate-convert_selectively is more
   suitable flag to check it.
  
   +       else if (strcmp(defel-defname, convert_binary) == 0)
   +       {
   +           if (cstate-convert_binary)
   +               ereport(ERROR,
   +                       (errcode(ERRCODE_SYNTAX_ERROR),
   +                        errmsg(conflicting or redundant
   + options)));
   +           cstate-convert_selectively = true;
   +           if (defel-arg == NULL || IsA(defel-arg, List))
   +               cstate-convert_binary = (List *) defel-arg;
   +           else
   +               ereport(ERROR,
   +                       (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
   +                        errmsg(argument to option \%s\ must be a
   list of column names,
   +                               defel-defname)));
   +       }
  
   Yes, defel-arg can be NIL.  defel-arg is a List structure listing
   all the columns needed to be converted to binary representation, which
   is NIL for the case where no columns are needed to be converted.  For
   example,
   defel-arg is NIL for SELECT COUNT(*).  In this case, while
   cstate-convert_selectively is set to true, no columns are converted
   cstate-at
   NextCopyFrom().  Most efficient case!  In short,
   cstate-convert_selectively represents whether to do selective binary
   conversion at NextCopyFrom(), and
   cstate-convert_binary represents all the columns to be converted at
   NextCopyFrom(), that can be NIL.
  
   At NextCopyFrom(), this routine computes default values if configured.
   In case when these values are not referenced, it might be possible to
   skip unnecessary calculations.
   Is it unavailable to add logic to avoid to construct cstate-defmap
   on unreferenced columns at  BeginCopyFrom()?
  
   I think that we don't need to add the above logic because file_fdw
   does
   BeginCopyFrom() with attnamelist = NIL, in which case, BeginCopyFrom()
   doesn't construct cstate-defmap at all.
  
   I fixed a bug plus some minor optimization in
   check_binary_conversion() that is renamed to
   check_selective_binary_conversion() in the updated version, and now
   file_fdw gives up selective binary conversion for the 

Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Robert Haas
On Tue, Jun 26, 2012 at 6:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 So let's fix the 80% case with something we feel confident in, and then
 revisit the no-sysv interlock as a separate patch.  That way if we can't
 fix the interlock issues, we still have a reduced-shmem version of Postgres.

 Yes.  Insisting that we have the whole change in one patch is a good way
 to prevent any forward progress from happening.  As Alvaro noted, there
 are plenty of issues to resolve without trying to change the interlock
 mechanism at the same time.

So, here's a patch.  Instead of using POSIX shmem, I just took the
expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
memory.  The sysv shm is still allocated, but it's just a copy of
PGShmemHeader; the real shared memory is the anonymous block.  This
won't work if EXEC_BACKEND is defined so it just falls back on
straight sysv shm in that case.

There are obviously some portability issues here - this is documented
not to work on Linux = 2.4, but it's not clear whether it fails with
some suitable error code or just pretends to work and does the wrong
thing.  I tested that it does compile and work on both Linux 3.2.6 and
MacOS X 10.6.8.  And the comments probably need work and... who knows
what else is wrong.  But, thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


anonymous-shmem.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1

2012-06-26 Thread Qi Huang

Hi, hackersI modified the code in add_path() a bit so that all the query 
path candidates inside pathlist will not be removed and all new path will be 
added into the pathlist, thus all path candidates are kept in pathlist. I then 
tested a four-relation query. In 9.1.3, I can see thousands of candidates in 
the final RelOptInfo, and some of them are even busy trees. But in 9.2 beta1 
which I forked from github, there are no such busy trees and only about 50 join 
path in total, which should match the requirement of System R algo. Is there 
any modification regarding the system R algo in the new release? And something 
wrong in algo in 9.1.3?Thanks
   

Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed

2012-06-26 Thread Etsuro Fujita
Hi,

 -Original Message-
 From: Robert Haas [mailto:robertmh...@gmail.com]
 Sent: Wednesday, June 27, 2012 5:09 AM
 To: Etsuro Fujita
 Cc: Ants Aasma; Jay Levitt; Tom Lane; PostgreSQL-development; Francois Deliege
 Subject: Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is
needed
 
 On Fri, Jun 22, 2012 at 10:12 AM, Robert Haas robertmh...@gmail.com wrote:
  On Tue, Jun 19, 2012 at 5:41 AM, Etsuro Fujita
  fujita.ets...@lab.ntt.co.jp wrote:
  I'm confused by this remark, because surely the query planner does it this
  way only if there's no LIMIT.  When there is a LIMIT, we choose based on
  the startup cost plus the estimated fraction of the total cost we expect
  to pay based on dividing the LIMIT by the overall row count estimate.  Or
  is this not what you're talking about?
 
  I think that Ants is pointing the way of estimating costs in
  choose_hashed_grouping()/choose_hashed_distinct(), ie cost_agg() for
  cheapest_path + hashagg, where the costs are calculated based on the total
  cost only of cheapest_path.  I think that it might be good to do cost_agg()
  for the discussed case with the AGG_SORTED strategy, not the AGG_HASHED
  strategy.
 
  Well, Ants already made some adjustments to those functions; not sure
  if this means they need some more adjustment, but I don't see that
  there's a general problem with the costing algorithm around LIMIT.
 
 Ants, do you intend to update this patch for this CommitFest?  Or at
 all?  It seems nobody's too excited about this, so I'm not sure
 whether it makes sense for you to put more work on it.  But please
 advise as to your plans.

Please excuse my slow response, I would also like to know your plan.

Best regards,
Etsuro Fujita

 Thanks,
 
 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Posix Shared Mem patch

2012-06-26 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 So, here's a patch.  Instead of using POSIX shmem, I just took the
 expedient of using mmap() to map a block of MAP_SHARED|MAP_ANONYMOUS
 memory.  The sysv shm is still allocated, but it's just a copy of
 PGShmemHeader; the real shared memory is the anonymous block.  This
 won't work if EXEC_BACKEND is defined so it just falls back on
 straight sysv shm in that case.

Um.  I hadn't thought about the EXEC_BACKEND interaction, but that seems
like a bit of a showstopper.  I would not like to give up the ability
to debug EXEC_BACKEND mode on Unixen.

Would Posix shmem help with that at all?  Why did you choose not to
use the Posix API, anyway?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1

2012-06-26 Thread Tom Lane
Qi Huang huangq...@hotmail.com writes:
 Hi, hackersI modified the code in add_path() a bit so that all the query 
 path candidates inside pathlist will not be removed and all new path will be 
 added into the pathlist, thus all path candidates are kept in pathlist. I 
 then tested a four-relation query. In 9.1.3, I can see thousands of 
 candidates in the final RelOptInfo, and some of them are even busy trees. But 
 in 9.2 beta1 which I forked from github, there are no such busy trees and 
 only about 50 join path in total, which should match the requirement of 
 System R algo. Is there any modification regarding the system R algo in the 
 new release? And something wrong in algo in 9.1.3?Thanks

[ shrug... ]  When you're not showing us exactly what you did, it's hard
to answer that for sure.  But there is some prefiltering logic in
joinpath.c that you might have to lobotomize too if you want to keep
known-inferior join paths.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizer Path Candidates difference in 9.1.3 and 9.2 beta1

2012-06-26 Thread Qi Huang


 [ shrug... ] When you're not showing us exactly what you did, it's hard
 to answer that for sure. But there is some prefiltering logic in
 joinpath.c that you might have to lobotomize too if you want to keep
 known-inferior join paths.
 
 regards, tom lane

Thanks, Tom.
Below is what I did for the code in add_path(). I commented out the section of 
remove_old, and also make sure always accept_new. Then at make_one_rel(), after 
calling rel = make_rel_from_joinlist(root, joinlist); , I set pprint(rel) 
in the next line. Checking the RelOptInfo printed out in logfile, I can see 
some bushy trees, in 9.1.3. 

267 ---/*268 --- * Loop to check proposed new path against old paths.  Note 
it is possible269 --- * for more than one old path to be tossed out because 
new_path dominates270 --- * it.271 --- *272 --- * We can't use foreach here 
because the loop body may delete the current273 --- * list cell.274 --- */275 
---p1_prev = NULL;276 ---for (p1 = list_head(parent_rel-pathlist); p1 != 
NULL; p1 = p1_next)277 ---{   ...338339 
--/* 340 -- * Remove current element from pathlist if dominated by 
new. 341 -- */ 342 //if (remove_old) 343 //{ 344 
//---parent_rel-pathlist = list_delete_cell(parent_rel-pathlist, 345 
//-p1, p1_prev);   
   346   347 -/* 348 
- * Delete the data pointed-to by the deleted cell, if possible 349 
- */ 350 //---if (!IsA(old_path, IndexPath)) 351 /
 /--pfree(old_path); 352 -/* p1_prev does not advance */ 
353 //} 354 //else 355 //{ 356 -/* new belongs 
after this old path if it has cost = old's */ 357 -if (costcmp = 
0) 358 insert_after = p1; 359 -/* p1_prev advances 
*/ 360 -p1_prev = p1; 361 //}362   363 --/* 364 -- 
* If we found an old path that dominates new_path, we can quit 365 -- * 
scanning the pathlist; we will not add new_path, and we assume 366 -- * 
new_path cannot dominate any other elements of the pathlist. 367 -- */ 
368 --if (!accept_new) 369 -break; 370 ---} 371   372 //-if 
(accept_new)
373 //-{ 374 --/* Accept the 
new path: insert it at proper place in pathlist */ 375 --if 
(insert_after) 376 -lappend_cell(parent_rel-pathlist, in
 sert_after, new_path); 377 --else 378 -parent_rel-pathlist = 
lcons(new_path, parent_rel-pathlist); 379 //-} 380 //-else 381 //-{ 382 
--/* Reject and recycle the new path */ 383 //if (!IsA(new_path, 
IndexPath)) 384 //---pfree(new_path); 385 //-}


Best RegardsHuang Qi VictorComputer Science of National University of Singapore 
  

  1   2   >