Re: [HACKERS] Function and view to retrieve WAL receiver status

2015-12-14 Thread Gurjeet Singh
On Sun, Dec 13, 2015 at 10:15 PM, Michael Paquier <michael.paqu...@gmail.com
> wrote:

> On Mon, Dec 14, 2015 at 3:09 PM, Gurjeet Singh <singh.gurj...@gmail.com>
> wrote:
> > On Dec 13, 2015 9:56 PM, "Michael Paquier" <michael.paqu...@gmail.com>
> > wrote:
> >> If the node has no WAL receiver active, a tuple with NULL values is
> >> returned instead.
> >
> > IMO, in the absence of a WAL receiver the SRF (and the view) should not
> > return any rows.
>
> The whole point is to not use a SRF in this case: there is always at
> most one WAL receiver.
>

The function, maybe. But emitting an all-nulls row from a view seems
counter-intuitive, at least when looking at it in context of relational
database.

-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] Function and view to retrieve WAL receiver status

2015-12-13 Thread Gurjeet Singh
On Dec 13, 2015 9:56 PM, "Michael Paquier" 
wrote:

>
> If the node has no WAL receiver active, a tuple with NULL values is
> returned instead.

IMO, in the absence of a WAL receiver the SRF (and the view) should not
return any rows.


Re: [HACKERS] Limit GIST_MAX_SPLIT_PAGES to XLR_MAX_BLOCK_ID

2015-10-27 Thread Gurjeet Singh
(adding Heikki, since that macro was touched by his commit 04e298b8)

Does my previous description of the problem make sense, or am I fretting
over something that's a non-issue.

Best regards,


On Sun, Sep 20, 2015 at 7:03 PM, Gurjeet Singh <gurj...@singh.im> wrote:

> Gin code respects the XLR_MAX_BLOCK_ID when
> calling XLogEnsureRecordSpace(). But it appears that the Gist code does not
> try to limit its block-id consumption to XLR_MAX_BLOCK_ID.
>
> The GIST_MAX_SPLIT_PAGES is hard-coded at 75, but XLogEnsureRecordSpace()
> would reject a request of more than 32 (XLR_MAX_BLOCK_ID).
>
> The attached patch redefines GIST_MAX_SPLIT_PAGES so that in case of a
> split, gistplacetopage() now throws an error when the block-ids needed
> exceed 32.
>
> I have used Min(75, XLR_MAX_BLOCK_ID) as the macro expansion, but I
> believe it can be set to plain XLR_MAX_BLOCK_ID.
>
>
> --
> Gurjeet Singh http://gurjeet.singh.im/
>



-- 
Gurjeet Singh http://gurjeet.singh.im/


[HACKERS] Limit GIST_MAX_SPLIT_PAGES to XLR_MAX_BLOCK_ID

2015-09-20 Thread Gurjeet Singh
Gin code respects the XLR_MAX_BLOCK_ID when
calling XLogEnsureRecordSpace(). But it appears that the Gist code does not
try to limit its block-id consumption to XLR_MAX_BLOCK_ID.

The GIST_MAX_SPLIT_PAGES is hard-coded at 75, but XLogEnsureRecordSpace()
would reject a request of more than 32 (XLR_MAX_BLOCK_ID).

The attached patch redefines GIST_MAX_SPLIT_PAGES so that in case of a
split, gistplacetopage() now throws an error when the block-ids needed
exceed 32.

I have used Min(75, XLR_MAX_BLOCK_ID) as the macro expansion, but I believe
it can be set to plain XLR_MAX_BLOCK_ID.


-- 
Gurjeet Singh http://gurjeet.singh.im/


gist_limit_blockid_to_XLR_MAX_BLOCK_ID.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-08-11 Thread Gurjeet Singh
On Mon, Aug 10, 2015 at 7:12 AM, Andres Freund and...@anarazel.de wrote:

 On 2015-07-07 09:42:54 -0700, Gurjeet Singh wrote:
   /*
  + * Grab and save an LSN value to prevent WAL recycling past that point.
  + */
  +void
  +ReplicationSlotRegisterRestartLSN()
  +{

 Didn't like that description and function name very much. What does
 'grabbing' mean here? Should probably mention that it works on the
 currently active slot and modifies it.


In your version, I don't see a comment that refers to the fact that it
works on the currently active (global variable) slot.



 It's now ReplicationSlotReserveWal()


Okay.



  + ReplicationSlot *slot = MyReplicationSlot;
  +
  + Assert(slot != NULL);
  + Assert(slot-data.restart_lsn == InvalidXLogRecPtr);
  +
  + /*
  +  * The replication slot mechanism is used to prevent removal of
 required
  +  * WAL. As there is no interlock between this and checkpoints
 required, WAL
  +  * segment could be removed before
 ReplicationSlotsComputeRequiredLSN() has
  +  * been called to prevent that. In the very unlikely case that
 this happens
  +  * we'll just retry.
  +  */

 You removed some punctuation in that sentence converting a sentence in
 bad english into one without the original meaning ;). See the attached
 for a new version.


Your version looks better.



  +/*
* Flush all replication slots to disk.
*
* This needn't actually be part of a checkpoint, but it's a convenient
  @@ -876,7 +942,7 @@ StartupReplicationSlots(void)
   }
 
   /* 
  - * Manipulation of ondisk state of replication slots
  + * Manipulation of on-disk state of replication slots
*
* NB: none of the routines below should take any notice whether a slot
 is the
* current one or not, that's all handled a layer above.
  diff --git a/src/backend/replication/slotfuncs.c
 b/src/backend/replication/slotfuncs.c
  index 9a2793f..01b376a 100644
  --- a/src/backend/replication/slotfuncs.c
  +++ b/src/backend/replication/slotfuncs.c
  @@ -40,6 +40,7 @@ Datum
   pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
   {
Namename = PG_GETARG_NAME(0);
  + boolimmediately_reserve = PG_GETARG_BOOL(1);
Datum   values[2];
boolnulls[2];
TupleDesc   tupdesc;
  @@ -58,10 +59,28 @@ pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
/* acquire replication slot, this will check for conflicting names
 */
ReplicationSlotCreate(NameStr(*name), false, RS_PERSISTENT);
 
  - values[0] = NameGetDatum(MyReplicationSlot-data.name);
  + if (immediately_reserve)
  + {
  + /* Allocate restart-LSN, if the user asked for it */
  + ReplicationSlotRegisterRestartLSN();
  +
  + /* Write this slot to disk */
  + ReplicationSlotMarkDirty();
  + ReplicationSlotSave();
 
  - nulls[0] = false;
  - nulls[1] = true;
  + values[0] = NameGetDatum(MyReplicationSlot-data.name);
  + values[1] =
 LSNGetDatum(MyReplicationSlot-data.restart_lsn);
  +
  + nulls[0] = false;
  + nulls[1] = false;
  + }
  + else
  + {
  + values[0] = NameGetDatum(MyReplicationSlot-data.name);
  +
  + nulls[0] = false;
  + nulls[1] = true;
  + }

 I moved
 values[0] = NameGetDatum(MyReplicationSlot-data.name);
 nulls[0] = false;
 to outside the conditional block, there seems no reason to have it in
 there?


The assignment to values[0] is being done twice. We can do away with the
one in the else part of the if condition.

Also, there was a typo in my patch [1] and it seems to have made it into
the commit: acronymLSN/. Tom seems to have just fixed it in
commit 750fc78b.

Best regards,

[1]: I altered you to it in a personal email, but probably it fell through
the cracks.
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] 64-bit XIDs again

2015-07-31 Thread Gurjeet Singh
On Jul 30, 2015 2:23 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gavin Flower gavinflo...@archidevsys.co.nz writes:
  On 31/07/15 02:24, Heikki Linnakangas wrote:
  There is a big downside to expanding xmin/xmax to 64 bits: it takes
  space. More space means more memory needed for caching, more memory
  bandwidth, more I/O, etc.

  I think having a special case to save 32 bits per tuple would cause
  unnecessary complications, and the savings are minimal compared to the
  size of current modern storage devices and the typical memory used in
  serious database servers.

 I think the argument that the savings are minimal is pretty thin.
 It all depends on how wide your tables are --- but on a narrow table, say
 half a dozen ints, the current tuple size is 24 bytes header plus the same
 number of bytes of data.  We'd be going up to 32 bytes header which makes
 for a 16% increase in physical table size.  If your table is large,
 claiming that 16% doesn't hurt is just silly.

 But the elephant in the room is on-disk compatibility.  There is
 absolutely no way that we can just change xmin/xmax to 64 bits without a
 disk format break.  However, if we do something like what Heikki is
 suggesting, it's at least conceivable that we could convert incrementally
 (ie, if you find a page with the old header format, assume all tuples in
 it are part of epoch 0; and do not insert new tuples into it unless there
 is room to convert the header to new format ... but I'm not sure what we
 do about tuple deletion if the old page is totally full and we need to
 write an xmax that's past 4G).

Can we safely relegate the responsibility of tracking the per block epoch
to a relation fork?


Re: [HACKERS] A huge debt of gratitude - Michael Stonebraker

2015-07-23 Thread Gurjeet Singh
On Jul 22, 2015 12:07 PM, Jolly Chen jo...@chenfamily.com wrote:

 Hey everyone,

 You have probably heard that Mike Stonebraker recently won the Turing
award.  A recording of his award lecture is available at:
 https://www.youtube.com/watch?v=BbGeKi6T6QI

 It is an entertaining talk overall. If you fast forward to about the 1:07
mark, he makes some comments about postgres.

 Here’s my rough transcription:

 The abstract data type system in postgres has been added to a lot of
relational database systems. It's kind of de facto table stakes for
relational databases these days, essentially intact.  That idea was really
a good one. It was mentioned in the citation for my Turing award winning.
However, serendipity played a huge role, which is, the biggest impact of
postgres by far came from two Berkeley students that I'll affectionately
call Grumpy and Sleepy.  They converted the academic postgres prototype
from QUEL to SQL in 1995. This was in parallel to the commercial activity.
And then a pick-up team of volunteers, none of whom have anything to do
with me or Berkeley, have been shepherding that open source system ever
since 1995. The system that you get off the web for postgres comes from
this pick-up team.  It is open source at its best and I want to just
mention that I have nothing to do with that and that collection of folks we
all owe a huge debt of gratitude to, because they have robustize that code
line and made it so it really works.”

 Thank you all so much for your hard work over the last twenty years!!

 Affectionately,

 Grumpy

Thank you! And a big thanks to the stewards of the project.

Sincerely,


Re: [HACKERS] ALTER TABLE .. ADD PRIMARY KEY .. USING INDEX has dump-restore hazard

2015-07-22 Thread Gurjeet Singh
On Wed, Jul 22, 2015 at 7:34 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Jul 21, 2015 at 8:34 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  On Wed, Jul 22, 2015 at 1:23 AM, Robert Haas robertmh...@gmail.com
 wrote:
  Notice that the collation specifier is gone.  Oops.
 
  As it is not possible to specify directly a constraint for a PRIMARY
  KEY expression, what about switching dumpConstraint to have it use
  first a CREATE INDEX query with the collation and then use ALTER TABLE
  to attach the constraint to it? I am noticing that we already fetch
  the index definition in indxinfo via pg_get_indexdef. Thoughts?

 I guess the questions on my mind is Why is it that you can't do this
 directly when creating a primary key, but you can do it when turning
 an existing index into a primary key?

 If there's a good reason for prohibiting doing this when creating a
 primary key, then presumably it shouldn't be allowed when turning an
 index into a primary key either.  If there's not, then why not extend
 the PRIMARY KEY syntax to allow it?


+1. I think in the short term we can treat this as a bug, and fix the bug
by disallowing an index with attributes that cannot be present in an index
created by PRIMARY KEY constraint. The collation attribute on one of the
keys may be just one of many such attributes.

In the long term, we may want to allow collation in primary key, but that
will be a feature ideally suited for a major version release.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] ALTER TABLE .. ADD PRIMARY KEY .. USING INDEX has dump-restore hazard

2015-07-22 Thread Gurjeet Singh
On Tue, Jul 21, 2015 at 9:23 AM, Robert Haas robertmh...@gmail.com wrote:

 rhaas=# create unique index on foo (a collate C);
 CREATE INDEX
 rhaas=# alter table foo add primary key using index foo_a_idx;
 ALTER TABLE



 Now dump and restore this database.  Then:



 Notice that the collation specifier is gone.  Oops.


The intent of the 'USING INDEX' feature was to work around the problem that
PRIMARY KEY indexes cannot be reindexed concurrently to reduce bloat.

Considering that the feature doesn't work [1] (at least as shown in example
in the docs [2]) if there are any foreign keys referencing the table,
there's scope for improvement.

There was proposal to support reindexing the primary key index, but I am
not sure where that stands. If that's already in, or soon to be in core, we
may put a deprecation notice around USING INDEX. OTOH, if reindexing PKeys
is too difficult, we may want to support index-replacement via a top-level
ALTER TABLE subcommand that works in CONCURRENT mode, and not recommend the
method shown in the docs (i.e deprecate the USING INDEX syntax).

[1]: The DROP CONSTRAINT part of the command fails if there are any FKeys
pointing to this table.
[2]: http://www.postgresql.org/docs/9.4/static/sql-altertable.html

-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-07-15 Thread Gurjeet Singh
On Thu, Jun 25, 2015 at 8:43 PM, Brendan Jurd dire...@gmail.com wrote:

 On Fri, 26 Jun 2015 at 06:03 Gurjeet Singh gurj...@singh.im wrote:


 s/proportion/fraction/


 I think of these as synonymous -- do you have any particular reason to
 prefer fraction?  I don't feel strongly about it either way, so I'm quite
 happy to go with fraction if folks find that more expressive.


It just feels better to me in this context.

If the number of times used in Postgres code is any measure, 'fraction'
wins hands down: proportion : 33, fraction: 620.

I don't feel strongly about it, either. I can leave it up to the committer
to decide.





 + * The caller must hold (at least) shared AysncQueueLock.

 A possibly better wording: The caller must hold AysncQueueLock in (at
 least) shared mode.


 Yes, that is more accurate.


OK.





 Unnecessary whitespace changes in pg_proc.h for existing functions.


 I did group the asynchronous notification functions together, which seemed
 reasonable as there are now three of them, and changed the tabbing between
 the function name and namespace ID to match, as is done elsewhere in
 pg_proc.h.  I think those changes improve readability, but again I don't
 feel strongly about it.


Fair enough.



 +DESCR(get the current usage of the asynchronous notification queue);

 A possibly better wording: get the fraction of the asynchronous
 notification queue currently in use


 I have no objections to your wording.


OK. Please send a new patch with the changes you agree to, and I can mark
it ready for committer.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] replication slot restart_lsn initialization

2015-07-07 Thread Gurjeet Singh
On Tue, Jul 7, 2015 at 6:49 AM, Andres Freund and...@anarazel.de wrote:

 On 2015-07-07 06:41:55 -0700, Gurjeet Singh wrote:
  There seems to be a misplaced not operator  ! in that if statement, as
  well. That sucks :( The MacOS gcc binary is actually clang, and its
 output
  is too noisy [1], which is probably why I might have missed warning if
 any.

 Which version of clang is it? With newer ones you can individually
 disable nearly all of the warnings. I regularly use clang, and most of
 the time it compiles master without warnings.



 $ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr
--with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)
Target: x86_64-apple-darwin14.4.0
Thread model: posix

I see that -Wno-parentheses-equality might help, but I am not looking
forward to maintaining OS specific flags for clang-that-pretends-to-be-gcc
in my shell scripts [2]

[2] https://github.com/gurjeet/pgd

Attached is a patch that introduces SlotIsPhyscial/SlotIsLogical macros.
This patch, if accepted, goes on top of the v4 patch.


  pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
 {
  Namename = PG_GETARG_NAME(0);
+ boolactivate = PG_GETARG_BOOL(1);
  
   Don't like 'activate' much as a name. How about 'immediately_reserve'?
  
 
  I still like 'activate, but okay. How about 'reserve_immediately'
 instead?

 Activate is just utterly wrong. A slot isn't inactive before. And
 'active' already is used for slots that are currently in use
 (i.e. connected to).

  Also, do you want this name change just in the C code, or in the pg_proc
  and docs as well?

 All.


Patch v4 attached.

On a side note, I see that the pg_create_*_replication_slot() functions do
not behave transactionally; that is, rolling back a transaction does not
undo the slot creation. Should we prevent these, and corresponding drop
functions, from being called inside an explicit transaction?
PreventTransactionChain() is geared towards serving just the utility
commands. Do we protect against calling such functions in a transaction
block, or from user functions? How?

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


physical_repl_slot_activate_restart_lsn.v4.patch
Description: Binary data


slot_is_physical_or_logical_macros.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-07-07 Thread Gurjeet Singh
On Tue, Jul 7, 2015 at 4:59 AM, Andres Freund and...@anarazel.de wrote:

 On 2015-06-10 13:13:41 -0700, Gurjeet Singh wrote:
  + /*
  +  * Log an xid snapshot for logical replication.
 It's not needed for
  +  * physical slots as it is done in BGWriter on a
 regular basis.
  +  */
  + if (!slot-data.persistency == RS_PERSISTENT)
  + {
  + /* make sure we have enough information to
 start */
  + flushptr = LogStandbySnapshot();
  +
  + /* and make sure it's fsynced to disk */
  + XLogFlush(flushptr);
  + }

 Huh? The slot-data.persistency == RS_PERSISTENT check seems pretty much
 entirely random to me.


There seems to be a misplaced not operator  ! in that if statement, as
well. That sucks :( The MacOS gcc binary is actually clang, and its output
is too noisy [1], which is probably why I might have missed warning if any.

[1]: I am particularly annoyed by these:

note: remove extraneous parentheses around the comparison to silence this
warning
note: use '=' to turn this equality comparison into an assignment

Eg. : if (((opaque)-btpo_next == 0))

I'll see what I can do about these.


 I mean physical slots can (and normally are)
 persistent as well?  Instead check whether it's a database specifics lot.


Agreed, the slot being database-specific is the right indicator.



 The reasoning why it's not helpful for physical slots is wrong. The
 point is that a standby snapshot at that location will simply not help
 physical replication, it's only going to read ones after the location at
 which the base backup starts (i.e. the location from the backup label).

   pg_create_physical_replication_slot(PG_FUNCTION_ARGS)
   {
Namename = PG_GETARG_NAME(0);
  + boolactivate = PG_GETARG_BOOL(1);

 Don't like 'activate' much as a name. How about 'immediately_reserve'?


I still like 'activate, but okay. How about 'reserve_immediately' instead?

Also, do you want this name change just in the C code, or in the pg_proc
and docs as well?



 Other than that it's looking good to me.


Will send a new patch after your feedback on the 'activate' renaming.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] More logging for autovacuum

2015-07-07 Thread Gurjeet Singh
On Tue, Jul 7, 2015 at 11:20 AM, Sawada Masahiko sawada.m...@gmail.com
wrote:

 On Sat, Jul 4, 2015 at 2:30 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  On Thu, Jul 2, 2015 at 4:41 AM, Gurjeet Singh gurj...@singh.im wrote:
  log_min_messages acts as a single gate for everything headed for the
  server logs; controls for per-background process logging do not exist.
 If
  one wants to see DEBUG/INFO messages for just the Autovacuum (or
  checkpointer, bgwriter, etc.), they have to set log_min_messages to that
  level, but the result would be a lot of clutter from other processes to
  grovel through, to see the messages of interest.
 
 
  I think that will be quite helpful.  During the patch development of
  parallel sequential scan, it was quite helpful to see the LOG messages
  of bgworkers, however one of the recent commits (91118f1a) have
  changed those to DEBUG1, now if I have to enable all DEBUG1
  messages, then what I need will be difficult to find in all the log
  messages.
  Having control of separate logging for background tasks will serve such
  a purpose.
 
  The facilities don't yet exist, but it'd be nice if such parameters when
  unset (ie NULL) pick up the value of log_min_messages. So by default,
 the
  users will get the same behaviour as today, but can choose to tweak per
  background-process logging when needed.
 
 
  Will this proposal allow user to see all the messages by all the
 background
  workers or will it be per background task.  Example in some cases user
  might want to see the messages by all bgworkers and in some cases one
  might want to see just messages by AutoVacuum and it's workers.


I want the logging to be controlled per background process, so that user
can pick and choose how much detail they want from each process. In absence
of such a setting for a background process, the global log_min_messages
should control the logging.



  I think here designing User Interface is an important work if others also
  agree
  that such an option is useful, could you please elaborate your proposal?

 +1


I would like this feature to be as simple as the current log_min_messages;
currently a superuser can do ALTER USER X SET log_min_messages = Y, and
control logging for specific users or databases or a combination of those.

In the same vein, setting autovacuum.log_min_messages in postgresql.conf,
or a corresponding ALTER SYSTEM should be enough to use a different setting
of log_min_messages inside autovacuum launcher and its worker processes.

I am using autovacuum process as an example, but the same is applicable to
other background processes as well. E.g. checkpointer.log_min_messages.

As for the implementation details, upon startup and after every reload, the
autovacuum launcher process will look for autovacuum.log_min_messages being
defined; if yes, it'd set the global variable log_min_messages in code to
that value, and if not defined in any of the conf files, the global
variable in code would be assigned whatever is assigned to the GUC param
log_min_messages. DefineCustomXXVariable() seems incapable of supporting
this behaviour, so we might have to invent a new one.

This will allow us to keep the current behaviour as default when
autovacuum.log_min_messages is not defined, and provide finer control over
autovacuum's logging by defining this variable, when needed.

Same goes for Checkpointer, BGWriter, WALWriter, and BGWorkers. BTW, the
facility we invent needn't be limited to log_min_messages, but it's just
what we need now.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] More logging for autovacuum

2015-07-01 Thread Gurjeet Singh
On Fri, Aug 17, 2007 at 3:14 PM, Alvaro Herrera alvhe...@commandprompt.com
wrote:

 Gregory Stark wrote:
 
  I'm having trouble following what's going on with autovacuum and I'm
 finding
  the existing logging insufficient. In particular that it's only logging
 vacuum
  runs *after* the vacuum finishes makes it hard to see what vacuums are
 running
  at any given time. Also, I want to see what is making autovacuum decide
 to
  forgo vacuuming a table and the log with that information is at DEBUG2.

 So did this idea go anywhere?


Assuming the thread stopped here, I'd like to rekindle the proposal.

log_min_messages acts as a single gate for everything headed for the server
logs; controls for per-background process logging do not exist. If one
wants to see DEBUG/INFO messages for just the Autovacuum (or checkpointer,
bgwriter, etc.), they have to set log_min_messages to that level, but the
result would be a lot of clutter from other processes to grovel through, to
see the messages of interest.

The facilities don't yet exist, but it'd be nice if such parameters when
unset (ie NULL) pick up the value of log_min_messages. So by default, the
users will get the same behaviour as today, but can choose to tweak per
background-process logging when needed.

Absent such a feature, one hack is to set the desired log_min_messages
value in conf file and send SIGHUP to just the process of interest and
revert the conf file back; but it's a hack.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-27 Thread Gurjeet Singh
On Fri, Jun 26, 2015 at 9:45 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 Sometime back on one of the PostgreSQL blog [1], there was
 discussion about the performance of drop/truncate table for
 large values of shared_buffers and it seems that as the value
 of shared_buffers increase the performance of drop/truncate
 table becomes worse.  I think those are not often used operations,
 so it never became priority to look into improving them if possible.

 I have looked into it and found that the main reason for such
 a behaviour is that for those operations it traverses whole
 shared_buffers and it seems to me that we don't need that
 especially for not-so-big tables.  We can optimize that path
 by looking into buff mapping table for the pages that exist in
 shared_buffers for the case when table size is less than some
 threshold (say 25%) of shared buffers.

 Attached patch implements the above idea and I found that
 performance doesn't dip much with patch even with large value
 of shared_buffers.  I have also attached script and sql file used
 to take performance data.


+1 for the effort to improve this.

With your technique added, there are 3 possible ways the search can happen
a) Scan NBuffers and scan list of relations, b) Scan NBuffers and bsearch
list of relations, and c) Scan list of relations and then invalidate blocks
of each fork from shared buffers. Would it be worth it finding one
technique that can serve decently from the low-end shared_buffers to the
high-end.

On patch:

There are multiple naming styles being used in DropForkSpecificBuffers();
my_name and myName. Given this is a new function, it'd help to be
consistent.

s/blk_count/blockNum/

s/new//, for eg. newTag, because there's no corresponding tag/oldTag
variable in the function.

s/blocksToDel/blocksToDrop/. BTW, we never pass anything other than the
total number of blocks in the fork, so we may as well call it just
numBlocks.

s/traverse_buf_freelist/scan_shared_buffers/, because when it is true, we
scan the whole shared_buffers.

s/rel_count/rel_num/

Reduce indentation/tab in header-comments of DropForkSpecificBuffers(). But
I see there's precedent in neighboring functions, so this may be okay.

Doing pfree() of num_blocks, num_fsm_blocks and num_vm_blocks in one place
(instead of two, at different indentation levels) would help readability.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-25 Thread Gurjeet Singh
Patch reviewed following the instructions on
https://wiki.postgresql.org/wiki/Reviewing_a_Patch

# Submission review
- Is the patch in a patch format which has context?
Yes. Note to other reviewers: `git diff —patience` yields patch better
suited for readability

- Does it apply cleanly to the current git master?
Yes.

- Does it include reasonable tests, necessary doc patches, etc?
Doc patch - Yes.
Tests - Yes.

# Usability review
- Does the patch actually implement the feature?
Yes.

- Do we want that?
Yes; see the discussion in mailing list.

- Do we already have it?
No.

- Does it follow SQL spec, or the community-agreed behavior?
Yes. It seems to implement the behavior agreed upon in the mailing list.

- Does it include pg_dump support (if applicable)?
N/A

- Are there dangers?
None that I could spot.

- Have all the bases been covered?
There’s room for an additional test which tests for non-zero return value.

# Feature test
- Does the feature work as advertised?
Yes. Build configured with '--enable-debug --enable-cassert CFLAGS=-O0’.
With a slightly aggressive notifications-in-a-loop script I was able to see
non-zero return value:

Session 1:
listen ggg;
begin;

Session 2:
while sleep 0.1; do echo 'notify ggg; select
pg_notification_queue_usage();' ; done | psql

- Are there corner cases the author has failed to consider?
No.

- Are there any assertion failures or crashes?
The patch exposes an already available function to the SQL interface, and
rearranges code, so it doesn’t look like the patch can induce an assertion
or a crash.

- Performance review
Not evaluated, since it’s not a performance patch, and it doesn’t seem to
impact any performance critical code path, ,either.


Additional notes:

Patch updates the docs of another function (pg_listening_channels), but the
update is an improvement so we can let it slide :)

+   proportion of the queue that is currently occupied by pending
notifications.

s/proportion/fraction/

+ * The caller must hold (at least) shared AysncQueueLock.

A possibly better wording: The caller must hold AysncQueueLock in (at
least) shared mode.

Unnecessary whitespace changes in pg_proc.h for existing functions.

+DESCR(get the current usage of the asynchronous notification queue);

A possibly better wording: get the fraction of the asynchronous
notification queue currently in use


On Thu, Jun 18, 2015 at 10:47 AM, Merlin Moncure mmonc...@gmail.com wrote:

 On Wed, Jun 17, 2015 at 6:15 PM, Brendan Jurd dire...@gmail.com wrote:
  Posting v2 of the patch, incorporating some helpful suggestions from
 Merlin.

 Based on perfunctory scan of the code, I think this is gonna make it,
 unless you draw some objections based on lack of necessity.

 merlin


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-17 Thread Gurjeet Singh
I don't see this in the CF app; can you please add it there?

Best regards,

On Wed, Jun 17, 2015 at 3:31 AM, Brendan Jurd dire...@gmail.com wrote:

 Hello hackers,

 I present a patch to add a new built-in function
 pg_notify_queue_saturation().

 The purpose of the function is to allow users to monitor the health of
 their notification queue.  In certain cases, a client connection listening
 for notifications might get stuck inside a transaction, and this would
 cause the queue to keep filling up, until finally it reaches capacity and
 further attempts to NOTIFY error out.

 The current documentation under LISTEN explains this possible gotcha, but
 doesn't really suggest a useful way to address it, except to mention that
 warnings will show up in the log once you get to 50% saturation of the
 queue.  Unless you happen to be eyeballing the logs when it happens, that's
 not a huge help.  The choice of 50% as a threshold is also very much
 arbitrary, and by the time you hit 50% the problem has likely been going on
 for quite a while.  If you want your nagios (or whatever) to say, alert you
 when the queue goes over 5% or 1%, your options are limited and awkward.

 The patch has almost no new code.  It makes use of the existing logic for
 the 50% warning.  I simply refactored that logic into a separate function
 asyncQueueSaturation, and then added pg_notify_queue_saturation to make
 that available in SQL.

 I am not convinced that pg_notify_queue_saturation is the best possible
 name for this function, and am very much open to other suggestions.

 The patch includes documentation, a regression test and an isolation test.

 Cheers,
 BJ


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Gurjeet Singh
On Wed, Jun 10, 2015 at 8:36 AM, Andres Freund and...@anarazel.de wrote:

 On 2015-06-10 08:24:23 -0700, Gurjeet Singh wrote:
  On Wed, Jun 10, 2015 at 8:07 AM, Andres Freund and...@anarazel.de
 wrote:
   That doesn't look right to me. Why is this code logging a standby
   snapshot for physical slots?
  
 
  This is the new function I referred to above. The logging of the snapshot
  is in 'not RecoveryInProgress()' case, meaning it's running in primary
 and
  not in a standby.

 It's still done uselessly for physical slots?


I missed the comments on LogStandbySnapshot(). Attached is the new patch
which does the snapshot logging only for a logical replication slot.

I am in the process of writing up a doc patch, and will submit that as well
in a short while.

I have removed a few #include statements which seemed unnecessary. These
changes are not relevant to the patch, so I can readily revert those parts
if deemed necessary.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


physical_repl_slot_activate_restart_lsn.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Interval arithmetic should emit interval in canonical format

2014-07-15 Thread Gurjeet Singh
It's hard to argue that the current behaviour is wrong, but it's worth a try.

First I'd appreciate the official reasons why Postgres prefers to
keep interval values in non-canonical form, like '1 day -23:37:00'
instead of '00:23:00'. I understand it has something to do with a
year/month/day not being exactly 365-days/30-days/24-hours, and/or
operations involving interval and 'timestamp with time zone'. But
since it's not explicitly spelled out in docs or in code (at least I
didn't find it in the obvious places), seeking explanation here. I
understand that the answers may obviate any change in behaviour I am
requesting below.

The interval arithmetic operations may also yield non-canonical
values, and IMHO the 'interval op interval' or 'interval op scalar'
expressions should yield an interval in canonical form. For eg.

postgres=# select '6 days 00:16:00'::interval - '5 days
23:53:00'::interval as result;
 result
-
 1 day -23:37:00

postgres=# select '6 days 00:16:00'::interval + '5 days
23:53:00'::interval as result;
  result
--
 11 days 24:09:00

I cannot think of a use case where the above results are any better
than emitting '00:23:00' and '12 days 00:09:00', respectively.

We may not be able to turn every interval datum into canonical form,
but at least the intervals produced as a result of interval operators
can be converted to canonical form to reduce surprises for users. I
may even go as far as proposing rounding up 24-hours into a day, but
not round up days into months or months into years.

I was surprised by the presence of non-canonical form of interval in a
sorted-by-interval result set. The intervals were computed within the
query, using 'timestamp without time zone' values in a table.

# select ...

 result

...
00:23:00
00:23:00
1 day -23:37:00
00:23:00
00:22:00
...

The ordering above demonstrates that Postgres _does_ consider '1 day
-23:37:00' == '00:23:00', then it seems pointless to confuse the user
by showing two different representations of the same datum. This also
increases the code complexity required in applications/ORMs to parse
interval data's text representation.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Patch to send transaction commit/rollback stats to the stats collector unconditionally.

2014-07-02 Thread Gurjeet Singh
On Wed, Jul 2, 2014 at 3:49 PM, Kevin Grittner kgri...@ymail.com wrote:
 In preparing to push the patch, I noticed I hadn't responded to this:

 Gurjeet Singh gurj...@singh.im wrote:
 Kevin Grittner kgri...@ymail.com wrote:
 I have reviewed this patch, and think we should do what the patch
 is trying to do, but I don't think the submitted patch would
 actually work.

 Just curious, why do you think it won't work.

 Because you didn't touch this part of the function:

 /*
  * Send partial messages.  If force is true, make sure that any pending
  * xact commit/abort gets counted, even if no table stats to send.
  */
 if (regular_msg.m_nentries  0 ||
 (force  (pgStatXactCommit  0 || pgStatXactRollback  0)))
 pgstat_send_tabstat(regular_msg);

 The statistics would not actually be sent unless a table had been
 accessed or it was forced because the connection was closing.

I sure did! In fact that was the one and only line of code that was
changed. It effectively bypassed the 'force' check if there were any
transaction stats to report.

 /*
- * Send partial messages.  If force is true, make sure that any pending
- * xact commit/abort gets counted, even if no table stats to send.
+ * Send partial messages.  Make sure that any pending xact
commit/abort gets
+ * counted, even if no table stats to send.
  */
 if (regular_msg.m_nentries  0 ||
-(force  (pgStatXactCommit  0 || pgStatXactRollback  0)))
+force || (pgStatXactCommit  0 || pgStatXactRollback  0))
 pgstat_send_tabstat(regular_msg);
 if (shared_msg.m_nentries  0)
 pgstat_send_tabstat(shared_msg);

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Patch to send transaction commit/rollback stats to the stats collector unconditionally.

2014-07-01 Thread Gurjeet Singh
On Sun, Jun 29, 2014 at 11:58 AM, Kevin Grittner kgri...@ymail.com wrote:
 I have reviewed this patch, and think we should do what the patch
 is trying to do, but I don't think the submitted patch would
 actually work.

Just curious, why do you think it won't work. Although the discussion
is a bit old, but I'm sure I would've tested the patch before
submitting.

 I have attached a suggested patch which I think
 would work.  Gurjeet, could you take a look at it?

The patch, when considered together with Tom's suggestion upthread,
looks good to me.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Patch to send transaction commit/rollback stats to the stats collector unconditionally.

2014-07-01 Thread Gurjeet Singh
On Tue, Jul 1, 2014 at 10:05 AM, Kevin Grittner kgri...@ymail.com wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:

 If we're going to do it like this, then I think the force flag
 should be considered to do nothing except override the clock
 check, which probably means it shouldn't be tested in the initial
 if() at all.

 That makes sense, and is easily done.

Attached is the patch to save you a few key strokes :)

 The only question left is
 how far back to take the patch.  I'm inclined to only apply it to
 master and 9.4.  Does anyone think otherwise?

Considering this as a bug-fix, I'd vote for it to be applied to all
supported releases. But since this may cause unforeseen performance
penalty, I think it should be applied only as far back as the
introduction of PGSTAT_STAT_INTERVAL throttle.

The throttle was implemeted in 641912b, which AFAICT was part of 8.3.
So I guess all the supported releases it is.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB : www.EnterpriseDB.com :  The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3ab1428..c7f41a5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -753,7 +753,8 @@ pgstat_report_stat(bool force)
 
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList-tsa_used == 0) 
-   !have_function_stats  !force)
+   pgStatXactCommit == 0  pgStatXactRollback == 0 
+   !have_function_stats)
return;
 
/*
@@ -817,11 +818,11 @@ pgstat_report_stat(bool force)
}
 
/*
-* Send partial messages.  If force is true, make sure that any pending
-* xact commit/abort gets counted, even if no table stats to send.
+* Send partial messages.  Make sure that any pending xact commit/abort
+* gets counted, even if there are no table stats to send.
 */
if (regular_msg.m_nentries  0 ||
-   (force  (pgStatXactCommit  0 || pgStatXactRollback  0)))
+   pgStatXactCommit  0 || pgStatXactRollback  0)
pgstat_send_tabstat(regular_msg);
if (shared_msg.m_nentries  0)
pgstat_send_tabstat(shared_msg);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-07-01 Thread Gurjeet Singh
On Sun, Jun 15, 2014 at 2:51 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jun 12, 2014 at 9:31 AM, Gurjeet Singh gurj...@singh.im wrote:

 I don't have intimate knowledge of recovery but I think the above
 assessment of recovery's operations holds true. If you still think
 this is a concern, can you please provide a bit firm example using
 which I can visualize the problem you're talking about.

 Okay, let me try with an example:

 Assume No. of shared buffers = 5
 Before Crash:

 1. Pages in shared buffers numbered 3, 4, 5 have write operations
 performed on them, followed by lot of reads which makes their
 usage_count as 5.
 2. Write operation happened on pages in shared buffers numbered
 1, 2. Usage_count of these buffers is 1.
 3. Now one read operation needs some different pages and evict
 pages in shared buffers numbered 1 and 2 and read the required
 pages, so buffers 1 and 2 will have usage count as 1.
 4. At this moment shutdown initiated.
 5. Bgwriter saved just buffers 1 and 2 and crashed.

 After Crash:

 1. Recovery will read in pages on which operations happened in
 step-1 and 2 above.
 2. Buffer loader (pg_hibernator) will load buffers on which operations
 happened in step-3, so here it might needs to evict buffers which are
 corresponding to buffers of step-1 before crash.  So what this
 essentially means is that pg_hibernator can lead to eviction of more
 useful pages.

Granted, you have demonstrated that the blocks restored by
pg_hibernator can cause eviction of loaded-by-recovery blocks. But,
one can argue that pg_hibernator brought the shared-buffer contents to
to a state that is much closer to the pre-shutdown state than the
recovery would have restored them to. I think this supports the case
for pg_hibernator, that is, it is doing what it is supposed to do:
restore shared-buffers to pre-shutdown state.

I agree that there's uncertainty as to which buffers will be cleared,
and hence which blocks will be evicted. So pg_hibernator may cause
eviction of blocks that had higher usage count before the shutdown,
because they may have a lower/same usage count as other blocks'
buffers after recovery. There's not much that can be done for that,
because usage count information is not saved anywhere on disk, and I
don't think it's worth saving just for pg_hibernator's sake.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-07-01 Thread Gurjeet Singh
On Sat, Jun 7, 2014 at 6:48 AM, Cédric Villemain ced...@2ndquadrant.com wrote:
 Le lundi 3 février 2014 19:18:54 Gurjeet Singh a écrit :

 Possible enhancements:
 - Ability to save/restore only specific databases.
 - Control how many BlockReaders are active at a time; to avoid I/O
 storms.

FWIW, this has been implemented. By default, only one database is
restored at a time.

- Be smart about lowered shared_buffers across the restart.
 - Different modes of reading like pg_prewarm does.
 - Include PgFincore functionality, at least for Linux platforms.

 Please note that pgfincore is working on any system where PostgreSQL
 prefetch is working, exactly like pg_prewarm. This includes linux, BSD and
 many unix-like. It *is not* limited to linux.

 I never had a single request for windows, but windows does provides an API
 for that too (however I have no windows offhand to test).

 Another side note is that currently BSD (at least freeBSD) have a more
 advanced mincore() syscall than linux and offers a better analysis (dirty
 status is known) and they implemented posix_fadvise...

I have never used pgfincore, and have analyzed it solely based on the
examples provided, second-hand info, and some code reading, so the
following may be wrong; feel free to correct.

The UI of pgfincore suggests that to save a snapshot of an object,
pgfincore reads all the segments of the object and queries the OS
cache. This may take a lot of time on big databases. If this is done
at shutdown time, the time to finish shutdown will be proportional to
the size of the database, rather than being proportional to the amount
of data files in OS cache.


 There is a previous thread about that hibernation feature. Mitsuru IWASAKI
 did a patch, and it triggers some interesting discussions.

 Some notes in this thread are outdated now, but it's worth having a look at
 it:

 http://www.postgresql.org/message-id/20110504.231048.113741617.iwas...@jp.freebsd.org

 https://commitfest.postgresql.org/action/patch_view?id=549

Thanks for sharing these. I agree with Greg's observations there that
the shared-buffers are becoming increasingly smaller subset of the RAM
available on modern machines. But until it can be done in a
platform-independent way I doubt it will ever be accepted in Postgres.
Even when it's accepted, it would have to be off by default because of
the slow shutdown mentioned above.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB : www.EnterpriseDB.com : The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2014-06-23 Thread Gurjeet Singh
Thanks!

On Mon, Jun 16, 2014 at 3:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 Gurjeet Singh gurj...@singh.im writes:
 I tried to eliminate the 'pending' list, but I don't see a way around it.
 We need temporary storage somewhere to store the branches encountered on
 the right; in recursion case the call stack was serving that purpose.

 I still think we should fix this in the grammar, rather than introducing
 complicated logic to try to get rid of the recursion later.  For example,
 as attached.

 I went looking for (and found) some additional obsoleted comments, and
 convinced myself that ruleutils.c is okay as-is, and pushed this.

 regards, tom lane



-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-23 Thread Gurjeet Singh
On Wed, Jun 18, 2014 at 8:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Gurjeet Singh gurj...@singh.im writes:

 Please find attached the patch. It includes the doc changes as well.

 Applied with some editorialization.

Thanks!

would it be possible to include this in 9.4 as well?

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-23 Thread Gurjeet Singh
On Mon, Jun 23, 2014 at 11:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Gurjeet Singh gurj...@singh.im writes:
 would it be possible to include this in 9.4 as well?

 While this is clearly an improvement over what we had before, it's
 impossible to argue that it's a bug fix, and we are way past the 9.4
 feature freeze deadline.  In particular, packagers who've already done
 their 9.4 development work might be blindsided by us slipping this into
 9.4 release.  So while I wouldn't have a problem with putting this change
 into 9.4 from a technical standpoint, it's hard to argue that it'd meet
 project norms from a development-process standpoint.

While I'd love to reduce the number of future installations without
this fix in place, I respect the decision to honor project policy. At
the same time, this change does not break anything. It introduces new
environment variables which change the behaviour, but behaves the old
way in the absence of those variables. So I guess it's not changing
the default behaviour in incompatible way.

BTW, does the project publish the feature-freeze deadlines and other
dates somewhere (apart from on this list). I was looking the other day
and didn't find any reference. Either the commitfest app or the
'Developer' section of the wiki [1] seem to be good candidates for
this kind of information.

[1]: https://wiki.postgresql.org/wiki/Development_information

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-23 Thread Gurjeet Singh
On Mon, Jun 23, 2014 at 12:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Gurjeet Singh gurj...@singh.im writes:
 While I'd love to reduce the number of future installations without
 this fix in place, I respect the decision to honor project policy. At
 the same time, this change does not break anything. It introduces new
 environment variables which change the behaviour, but behaves the old
 way in the absence of those variables.

 Uh, no, it doesn't.  We removed the dependence on -DLINUX_OOM_SCORE_ADJ.
 If a packager is expecting that to still work in 9.4, he's going to be
 unpleasantly surprised, because the system will silently fail to do what
 he's expecting: it will run all the backend processes at no-OOM-kill
 priority, which is likely to be bad.

True. I didn't think from a packager's perspective.

 It is possible to make packages that will work either way, along the lines
 of the advice I sent to the Red Hat guys:
 https://bugzilla.redhat.com/show_bug.cgi?id=1110969

 but I think it's a bit late in the cycle to be telling packagers to do
 that for 9.4.

Understandable.

 BTW, does the project publish the feature-freeze deadlines and other
 dates somewhere (apart from on this list).

 It's usually in the dev meeting minutes
 https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule

Thanks. But it doesn't spell out the specific dates, or even the month
of feature-freeze.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-12 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 12:05 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 If we're going to do this, I would say that we should also take the
 opportunity to get out from under the question of which kernel API
 we're dealing with.  So perhaps a design like this:

 1. If the environment variable PG_OOM_ADJUST_FILE is defined, it's
 the name of a file to write something into after forking.  The typical
 value would be /proc/self/oom_score_adj or /proc/self/oom_adj.
 If not set, nothing happens.

 2. If the environment variable PG_OOM_ADJUST_VALUE is defined, that's
 the string we write, otherwise we write 0.

Please find attached the patch. It includes the doc changes as well.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com
diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index 9fadef5..4492a1d 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -1284,8 +1284,15 @@ echo -1000  /proc/self/oom_score_adj
 in the postmaster's startup script just before invoking the postmaster.
 Note that this action must be done as root, or it will have no effect;
 so a root-owned startup script is the easiest place to do it.  If you
-do this, you may also wish to build productnamePostgreSQL/
-with literal-DLINUX_OOM_SCORE_ADJ=0/ added to varnameCPPFLAGS/.
+do this, you may also wish to export these environment variables
+programlisting
+PG_OOM_ADJUST_FILE=oom_score_adj
+PG_OOM_ADJUST_VALUE=0
+
+export PG_OOM_ADJUST_FILE
+export PG_OOM_ADJUST_VALUE
+/programlisting
+in the startup script, before invoking the postmaster.
 That will cause postmaster child processes to run with the normal
 varnameoom_score_adj/ value of zero, so that the OOM killer can still
 target them at need.
@@ -1296,8 +1303,7 @@ echo -1000  /proc/self/oom_score_adj
 but may have a previous version of the same functionality called
 filename/proc/self/oom_adj/.  This works the same except the disable
 value is literal-17/ not literal-1000/.  The corresponding
-build flag for productnamePostgreSQL/ is
-literal-DLINUX_OOM_ADJ=0/.
+value for envarPG_OOM_ADJUST_FILE/ should be varnameoom_adj/.
/para
 
note
diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c
index f6df2de..21559af 100644
--- a/src/backend/postmaster/fork_process.c
+++ b/src/backend/postmaster/fork_process.c
@@ -22,6 +22,13 @@
 #endif
 
 #ifndef WIN32
+
+#ifdef __linux__
+static bool	oom_env_checked = false;
+static char	oom_adj_file[MAXPGPATH];
+static int	oom_adj_value = 0;
+#endif	/* __linux__ */
+
 /*
  * Wrapper for fork(). Return values are the same as those for fork():
  * -1 if the fork failed, 0 in the child process, and the PID of the
@@ -78,13 +85,38 @@ fork_process(void)
 		 * LINUX_OOM_SCORE_ADJ #defined to 0, or to some other value that you
 		 * want child processes to adopt here.
 		 */
-#ifdef LINUX_OOM_SCORE_ADJ
+#ifdef __linux__
+		if (!oom_env_checked)
+		{
+			char *env;
+
+			oom_env_checked = true;
+
+			env = getenv(PG_OOM_ADJUST_FILE);
+
+			/* Don't allow a path separator in file name */
+			if (env  !strchr(env, '/')  !strchr(env, '\\'))
+			{
+snprintf(oom_adj_file, MAXPGPATH, /proc/self/%s, env);
+
+env = getenv(PG_OOM_ADJUST_VALUE);
+if (env)
+{
+	oom_adj_value = atoi(env);
+}
+else
+	oom_adj_value = 0;
+			}
+			else
+oom_adj_file[0] = '\0';
+		}
+
 		{
 			/*
 			 * Use open() not stdio, to ensure we control the open flags. Some
 			 * Linux security environments reject anything but O_WRONLY.
 			 */
-			int			fd = open(/proc/self/oom_score_adj, O_WRONLY, 0);
+			int			fd = open(oom_adj_file, O_WRONLY, 0);
 
 			/* We ignore all errors */
 			if (fd = 0)
@@ -92,41 +124,14 @@ fork_process(void)
 char		buf[16];
 int			rc;
 
-snprintf(buf, sizeof(buf), %d\n, LINUX_OOM_SCORE_ADJ);
+snprintf(buf, sizeof(buf), %d\n, oom_adj_value);
 rc = write(fd, buf, strlen(buf));
 (void) rc;
 close(fd);
 			}
 		}
-#endif   /* LINUX_OOM_SCORE_ADJ */
 
-		/*
-		 * Older Linux kernels have oom_adj not oom_score_adj.  This works
-		 * similarly except with a different scale of adjustment values. If
-		 * it's necessary to build Postgres to work with either API, you can
-		 * define both LINUX_OOM_SCORE_ADJ and LINUX_OOM_ADJ.
-		 */
-#ifdef LINUX_OOM_ADJ
-		{
-			/*
-			 * Use open() not stdio, to ensure we control the open flags. Some
-			 * Linux security environments reject anything but O_WRONLY.
-			 */
-			int			fd = open(/proc/self/oom_adj, O_WRONLY, 0);
-
-			/* We ignore all errors */
-			if (fd = 0)
-			{
-char		buf[16];
-int			rc;
-
-snprintf(buf, sizeof(buf), %d\n, LINUX_OOM_ADJ);
-rc = write(fd, buf, strlen(buf));
-(void) rc;
-close(fd);
-			}
-		}
-#endif   /* LINUX_OOM_ADJ */
+#endif   /* __linux__ */
 
 		/*
 		 * Make sure processes do not share OpenSSL randomness state.

-- 
Sent via pgsql-hackers mailing

Re: [HACKERS] Proposing pg_hibernate

2014-06-11 Thread Gurjeet Singh
On Wed, Jun 11, 2014 at 12:25 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, Jun 11, 2014 at 7:59 AM, Gurjeet Singh gurj...@singh.im wrote:
 On Sun, Jun 8, 2014 at 3:24 AM, Amit Kapila amit.kapil...@gmail.com
 wrote:
  Yeap, but if it crashes before writing checkpoint record, it will lead
  to
  recovery which is what we are considering.

 Good point.

 In case of such recovery, the recovery process will read in the blocks
 that were recently modified, and were possibly still in shared-buffers
 when Checkpointer crashed. So after recovery finishes, the
 BlockReaders will be invoked (because save-files were successfully
 written before the crash), and they would request the same blocks to
 be restored. Most likely, those blocks would already be in
 shared-buffers, hence no cause of concern regarding BlockReaders
 evicting buffers populated by recovery.

 Not necessarily because after crash, recovery has to start from
 previous checkpoint, so it might not perform operations on same
 pages as are saved by buffer saver.

Granted, the recovery may not start that way (that is, reading in
blocks that were in shared-buffers when shutdown was initiated), but
it sure would end that way. Towards the end of recovery, the blocks
it'd read back in are highly likely to be the ones that were present
in shared-buffers at the time of shutdown. By the end of recovery,
either (a) blocks read in at the beginning of recovery are evicted by
later operations of recovery, or (b) they are still present in
shared-buffers. So the blocks requested by the BlockReaders are highly
likely to be already in shared-buffers at the end of recovery, because
these are the same blocks that were dirty (and hence recorded in WAL)
just before shutdown time.

I guess what I am trying to say is that the blocks read in by the
BlockReaders will be a superset of those read in by the reocvery
process. At the time of shutdown/saving-buffers, the shared-buffers
may have contained dirty and clean buffers. WAL contains the info of
which blocks were dirtied. Recovery will read back the blocks that
were dirty, to replay the WAL, and since the BlockReaders are started
_after_ recovery finishes, the BlockReaders will effectively read in
only those blocks that are not already read-in by the recovery.

I am not yet convinced, at least in this case, that Postgres
Hibernator would restore blocks that can cause eviction of buffers
restored by recovery.

I don't have intimate knowledge of recovery but I think the above
assessment of recovery's operations holds true. If you still think
this is a concern, can you please provide a bit firm example using
which I can visualize the problem you're talking about.

 Also as the file saved by
 buffer saver can be a file which contains only partial list of
 buffers which were in shared buffer's, it becomes more likely that
 in such cases it can override the buffers populated by recovery.

I beg to differ. As described above, the blocks read-in by the
BlockReader will not evict the recovery-restored blocks. The
save-files being written partially does not change that.

 Now as pg_hibernator doesn't give any preference to usage_count while
 saving buffer's, it can also evict the buffers populated by recovery
 with some lower used pages of previous run.

The case we're discussing (checkpointer/BufferSaver/some-other-process
crash during a smart/fast shutdown) should occur rarely in practice.
Although Postgres Hibernator is not yet proven to do the wrong thing
in this case, I hope you'd agree that BlockReaders evicting buffers
populated by recovery process is not catastrophic at all, merely
inconvenient from performance perspective. Also, the impact is only on
the initial performance immediately after startup, since application
queries will re-prime the shared-buffers with whatever buffers they
need.

 I think Block Readers will remove file only after reading and populating
 buffers from it

Correct.

 and that's the reason I mentioned that it can lead to doing
 both recovery as well as load buffers based on file saved by buffer
 saver.

I am not sure I completely understand the implication here, but I
think the above description of case where
recovery-followed-by-BlockReaders not causing a concern may cover it.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-11 Thread Gurjeet Singh
On Wed, Jun 11, 2014 at 10:56 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 10, 2014 at 10:03 PM, Gurjeet Singh gurj...@singh.im wrote:
 And it's probably accepted by now that such a bahviour is not
 catastrophic, merely inconvenient.

 I think the whole argument for having pg_hibernator is that getting
 the block cache properly initialized is important.  If it's not
 important, then we don't need pg_hibernator in the first place.  But
 if it is important, then I think not loading unrelated blocks into
 shared_buffers is also important.

I was constructing a contrived scenario, something that would rarely
happen in reality. I feel that the benefits of this feature greatly
outweigh the minor performance loss caused in such an unlikely scenario.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
On Mon, Sep 19, 2011 at 10:18 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote:
 But having said that, it wouldn't be very hard to arrange things so that
 if you did have both symbols defined, the code would only attempt to
 write oom_adj if it had failed to write oom_score_adj; which is about as
 close as you're likely to get to a kernel version test for this.

 Why is this feature not a run-time configuration variable or at least a
 configure option?  It's awfully well hidden now.  I doubt a lot of
 people are using this even though they might wish to.

 See the thread in which the feature was designed originally:
 http://archives.postgresql.org/pgsql-hackers/2010-01/msg00170.php

 The key point is that to get useful behavior, you need cooperation
 between both a root-privileged startup script and the PG executable.
 That tends to throw the problem into the domain of packagers, more
 than end users, and definitely puts a big crimp in the idea that
 run-time configuration of just half of the behavior could be helpful.
 So far, no Linux packagers have complained that the design is inadequate
 (a position that I also hold when wearing my red fedora) so I do not
 feel a need to complicate it further.

Startup scripts are not solely in the domain of packagers. End users
can also be expected to develop/edit their own startup scripts.

Providing it as GUC would have given end users both the peices, but
with a compile-time option they have only one half of the solution;
except if they go compile their own binaries, which forces them into
being packagers.

I am not alone in feeling that if Postgres wishes to provide a control
over child backend's oom_score_adj, it should be a GUC parameter
rather than a compile-time option. Yesterday a customer wanted to
leverage this and couldn't because they refuse to maintain their own
fork of Postgres code.

Please find attached a patch to turn it into a GUC, #ifdef'd by __linux__ macro.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com
diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c
index f6df2de..7b9bc38 100644
--- a/src/backend/postmaster/fork_process.c
+++ b/src/backend/postmaster/fork_process.c
@@ -22,6 +22,11 @@
 #endif
 
 #ifndef WIN32
+
+#ifdef __linux__
+int linux_oom_score_adj = 0;
+#endif
+
 /*
  * Wrapper for fork(). Return values are the same as those for fork():
  * -1 if the fork failed, 0 in the child process, and the PID of the
@@ -78,7 +83,7 @@ fork_process(void)
 		 * LINUX_OOM_SCORE_ADJ #defined to 0, or to some other value that you
 		 * want child processes to adopt here.
 		 */
-#ifdef LINUX_OOM_SCORE_ADJ
+#ifdef __linux__
 		{
 			/*
 			 * Use open() not stdio, to ensure we control the open flags. Some
@@ -92,13 +97,12 @@ fork_process(void)
 char		buf[16];
 int			rc;
 
-snprintf(buf, sizeof(buf), %d\n, LINUX_OOM_SCORE_ADJ);
+snprintf(buf, sizeof(buf), %d\n, linux_oom_score_adj);
 rc = write(fd, buf, strlen(buf));
 (void) rc;
 close(fd);
 			}
 		}
-#endif   /* LINUX_OOM_SCORE_ADJ */
 
 		/*
 		 * Older Linux kernels have oom_adj not oom_score_adj.  This works
@@ -106,7 +110,6 @@ fork_process(void)
 		 * it's necessary to build Postgres to work with either API, you can
 		 * define both LINUX_OOM_SCORE_ADJ and LINUX_OOM_ADJ.
 		 */
-#ifdef LINUX_OOM_ADJ
 		{
 			/*
 			 * Use open() not stdio, to ensure we control the open flags. Some
@@ -120,13 +123,13 @@ fork_process(void)
 char		buf[16];
 int			rc;
 
-snprintf(buf, sizeof(buf), %d\n, LINUX_OOM_ADJ);
+snprintf(buf, sizeof(buf), %d\n, linux_oom_score_adj);
 rc = write(fd, buf, strlen(buf));
 (void) rc;
 close(fd);
 			}
 		}
-#endif   /* LINUX_OOM_ADJ */
+#endif   /* __linux__ */
 
 		/*
 		 * Make sure processes do not share OpenSSL randomness state.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1d094f0..8713abb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -128,6 +128,7 @@ extern bool synchronize_seqscans;
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
 extern bool SSLPreferServerCiphers;
+extern int	linux_oom_score_adj;
 
 #ifdef TRACE_SORT
 extern bool trace_sort;
@@ -2554,6 +2555,18 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+#ifdef __linux__
+	{
+		{linux_oom_score_adj, PGC_POSTMASTER, RESOURCES_KERNEL,
+			gettext_noop(Sets the oom_score_adj parameter of postmaster's child processes.),
+			NULL,
+		},
+		linux_oom_score_adj,
+		0, -1000, 1000,
+		NULL, NULL, NULL
+	},
+#endif /* __linux__ */
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Gurjeet Singh gurj...@singh.im writes:
 Startup scripts are not solely in the domain of packagers. End users
 can also be expected to develop/edit their own startup scripts.

 Providing it as GUC would have given end users both the peices, but
 with a compile-time option they have only one half of the solution;
 except if they go compile their own binaries, which forces them into
 being packagers.

 I don't find that this argument holds any water at all.  Anyone who's
 developing their own start script can be expected to manage recompiling
 Postgres.

I respectfully disagree. Writing and managing init/start scripts
requires much different set of expertise than compiling and managing
builds of a critical software like a database product.

I would consider adding `echo -1000  /proc/self/oom_score_adj` to a
start script as a deployment specific tweak, and not 'developing own
start script'.

 Extra GUCs do not have zero cost, especially not ones that are as
 complicated-to-explain as this would be.

 I would also argue that there's a security issue with making it a GUC.
 A non-root DBA should not have the authority to decide whether or not
 postmaster child processes run with nonzero OOM adjustment; that decision
 properly belongs to whoever has authorized the root-owned startup script
 to change the adjustment in the first place.  So seeing this as two
 independent pieces is not only wrong but dangerous.

From experiments last night, I see that child process' oom_score_adj
is capped by the parent process' setting that is forking. So I don't
think it's a security risk from that perspective.

I'll retest and provide the exact findings.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote:
 Providing it as GUC would have given end users both the peices, but
 with a compile-time option they have only one half of the solution;
 except if they go compile their own binaries, which forces them into
 being packagers.

 I am not alone in feeling that if Postgres wishes to provide a control
 over child backend's oom_score_adj, it should be a GUC parameter
 rather than a compile-time option. Yesterday a customer wanted to
 leverage this and couldn't because they refuse to maintain their own
 fork of Postgres code.

 Independent of the rest of the discussion, I think there's one more
 point: Trying to keep your system stable by *increasing* the priority of
 normal backends is a bad idea. If you system gets into OOM land you need
 to fix that, not whack who gets killed around.
 The reason it makes sense to increase the priority of the postmaster is
 that that *does* increase the stability by cleaning up resources and
 restarting everything.

As it stands right now, a user can decrease the likelyhood of
Postmaster being killed by adjusting the start script, but that
decreases the likelyhood of al the child processes, too, making the
Postmaster just as likely as a kill-candidate, maybe even higher
because it's the parent, as any backend.

This patch gives the user a control to let the backend's likelyhood of
being killed be different/higher than that of the postmaster.

The users were already capable of doing that, but were required to
custom-compile Postgres to get the benefits. This patch just removes
that requirement.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
 , ...
1835 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
1837 -100 postgres: checkpointer process
...

# Lower checkpointer's badness, and try to lower it below postmaster's
$ echo -101  /proc/1837/oom_score_adj
bash: echo: write error: Permission denied

$ for p in $(echo $(pgserverPIDList)| cut -d , ...
1835 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
1837 -100 postgres: checkpointer process
...

# As root, force child process' badness to be lower than postnaster's
$ sudo echo -101  /proc/1837/oom_score_adj
[sudo] password for gurjeet:

$ for p in $(echo $(pgserverPIDList)| cut -d , ...
1835 -100 
/home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data
1837 -101 postgres: checkpointer process
...

-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 11:24 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-10 11:14:43 -0400, Tom Lane wrote:
 Sure, but what's that have to do with this?  Any Red Hat or PGDG RPM will
 come with this code already enabled in the build, so there is no need for
 anyone to have a GUC to play around with the behavior.

 That's imo a fair point. Unless I misunderstand things Gurjeet picked
 the topic up again because he wants to increase the priority of the
 children. Is that correct Gurjeet?

Yes. A DBA would like to prevent the postmaster from being killed by
OOM killer, but let the child processes be still subject to OOM
killer's whim.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 11:52 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 and...@anarazel.de (Andres Freund) writes:
 On 2014-06-10 11:20:28 -0400, Tom Lane wrote:
 Maybe I'm mistaken, but I thought once the fork_process code has reset our
 process's setting to zero it's not possible to lower it again (without
 privileges we'd not have).

 No, doesn't look that way. It's possible to reset it to the value set at
 process start. So unless we introduce double forks for every backend
 start it can be reset by ordinary processes.

 That's kind of annoying --- I wonder why they went to the trouble of doing
 that?  But anyway, it's probably not worth the cost of double-forking to
 prevent it.  I still say though that this is not a reason to make it as
 easy as change-a-GUC to break the intended behavior.

 Robert's idea of having the start script set an environment variable to
 control the OOM adjustment reset seems like it would satisfy my concern.

I'm fine with this solution. Should this be a constant 0, or be
configurable based on env. variable's value?

-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] /proc/self/oom_adj is deprecated in newer Linux kernels

2014-06-10 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 1:13 PM, David G Johnston
david.g.johns...@gmail.com wrote:
 Gurjeet Singh-4 wrote
 So the argument that this GUC is a security concern, can be ignored.
 Root user (one with control of start script) still controls the lowest
 badness setting of all Postgres processes. If done at fork_process
 time, the child process simply inherits parent's badness setting.

 The counter point here is that the postmaster can be set to no kill and

Only the root user can do that, since he/she has control over the
start script. All that the DBA could do with a GUC is to set backends'
badness worse than postmaster's, but bot better.

 the = condition allows for children to achieve the same while it is our
 explicit intent that the children be strictly  parent.

I don't think anyone argued for that behaviour.

 To that end, should the adjustment value be provided as an offset to the
 postmasters instead of an absolute value - and disallow = zero offset
 values in the process?

Seems unnecessary, given current knowledge.

 I get and generally agree with the environment variable proposal and it's
 stated goal to restrict whom can makes changes. But how much less cost does
 an environment variable have than a GUC if one GUC argument is still its
 maintenance overhead?

Having it as a GUC would have meant that two entities are required to
get the configuration right: one who controls start scripts, and the
other who controls GUC settings.

With the environment variable approach, a root user alone can control
the behaviour like so in start script:

echo -200  /proc/self/oom_score_adj
export PG_OOM_ADJUST_FILE=oom_score_adj
export PG_OOM_ADJUST_VALUE=-100

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-10 Thread Gurjeet Singh
On Tue, Jun 10, 2014 at 12:02 PM, Robert Haas robertmh...@gmail.com wrote:
 If recovery has been running for a long time, then restoring
 buffers from some save file created before that is probably a bad
 idea, regardless of whether the buffers already loaded were read in by
 recovery itself or by queries running on the system.  But if you're
 saying that doesn't happen, then there's no problem there.

Normally, it won't happen. There's one case I can think of, which has
to coincide with a small window of time for such a thing to happen.

Consider this:
.) A database is shutdown, which creates the save-files in
$PGDATA/pg_hibernator/.
.) The database is restarted.
.) BlockReaders begin to read and restore the disk blocks into buffers.
.) Before the BlockReaders could finish*, a copy of the database is
taken (rsync/cp/FS-snapshot/etc.)
This causes the the save-files to be present in the copy, because
the BlockReaders haven't deleted them, yet.
* (The BlockReaders ideally finish their task in first few minutes
after first of them is started.)
.) The copy of the database is used to restore and erect a warm-standby.
.) The warm-standby starts replaying logs from WAL archive/stream.
.) Some time (hours/weeks/months) later, the warm-standby is promoted
to be a master.
.) It starts the Postgres Hibernator, which sees save-files in
$PGDATA/pg_hibernator/ and launches BlockReaders.

At this point, the BlockReaders will restore the blocks that were
present in original DB's shared-buffers at the time of shutdown. So,
this would fetch blocks into shared-buffers that may be completely
unrelated to the blocks recently operated on by the recovery process.

And it's probably accepted by now that such a bahviour is not
catastrophic, merely inconvenient.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-10 Thread Gurjeet Singh
On Sun, Jun 8, 2014 at 3:24 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Fri, Jun 6, 2014 at 5:31 PM, Gurjeet Singh gurj...@singh.im wrote:
 On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila amit.kapil...@gmail.com
 wrote:

  Buffer saver process itself can crash while saving or restoring
  buffers.

 True. That may lead to partial list of buffers being saved. And the
 code in Reader process tries hard to read only valid data, and punts
 at the first sight of data that doesn't make sense or on ERROR raised
 from Postgres API call.

 Inspite of Reader Process trying hard, I think we should ensure by
 some other means that file saved by buffer saver is valid (may be
 first write in tmp file and then rename it or something else).

I see no harm in current approach, since even if the file is partially
written on shutdown, or if it is corrupted due to hardware corruption,
the worst that can happen is the BlockReaders will try to restore, and
possibly succeed, a wrong block to shared-buffers.

I am okay with your approach of first writing to a temp file, if
others see an advantage of doing this and insist on it.

  IIUC on shutdown request, postmaster will send signal to BG Saver
  and BG Saver will save the buffers and then postmaster will send
  signal to checkpointer to shutdown.  So before writing Checkpoint
  record, BG Saver can crash (it might have saved half the buffers)

 Case handled as described above.

  or may BG saver saves buffers, but checkpointer crashes (due to
  power outage or any such thing).

 Checkpointer process' crash seems to be irrelevant to Postgres
 Hibernator's  workings.

 Yeap, but if it crashes before writing checkpoint record, it will lead to
 recovery which is what we are considering.

Good point.

In case of such recovery, the recovery process will read in the blocks
that were recently modified, and were possibly still in shared-buffers
when Checkpointer crashed. So after recovery finishes, the
BlockReaders will be invoked (because save-files were successfully
written before the crash), and they would request the same blocks to
be restored. Most likely, those blocks would already be in
shared-buffers, hence no cause of concern regarding BlockReaders
evicting buffers populated by recovery.

 I think you are trying to argue the wording in my claim save-files
 are created only on clean shutdowons; not on a crash or immediate
 shutdown, by implying that a crash may occur at any time during and
 after the BufferSaver processing. I agree the wording can be improved.

 Not only wording, but in your above mail Case 2 and 1b would need to
 load buffer's and perform recovery as well, so we need to decide which
 one to give preference.

In the cases you mention, 1b and 2, ideally there will be no
save-files because the server either (1b) was still running, or (2)
crashed.

If there were any save-files present during the previous startup (the
one that happened before (1b) hot-backup or (2) crash) of the server,
they would have been removed by the BlockReaders soon after the
startup.

 So If you agree that we should have consideration for recovery data
 along with saved files data, then I think we have below options to
 consider:

I don't think any of the options you mention need any consideration
because recovery and buffer-restore process don't seem to be at
conflict with each other; not enough to be a concern, IMHO.

Thanks and best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Using Index-only scans to speed up count(*)

2014-06-07 Thread Gurjeet Singh
While reading [1] in context of Postgres Hibernator, I see that
Mitsuru mentioned one of the ways other RDBMS allows count(*) to be
driven by an index.

 'select /*+ INDEX(emp emp_pk) */ count(*) from emp;' to load index blocks

I am not sure if Postgres planner already allows this, but it would be
great if the planner considered driving a count(*) query using a
non-partial index, in the hopes that it turns into an index-only scan,
and hence returns count(*) result faster.The non-partial index may not
necessarily be the primary key index, it can be chosen purely based on
size, favouring smaller indexes.

This may alleviate some of the concerns of people migrating
applications from other DBMS' that perform count(*) in a blink of an
eye.

[1]: 
http://www.postgresql.org/message-id/20110507.08.83883502.iwas...@jp.freebsd.org

Best regards,

PS: Please note that I am not proposing to add support for the
optimizer hint embedded in Mitsuru's query.
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using Index-only scans to speed up count(*)

2014-06-07 Thread Gurjeet Singh
On Sat, Jun 7, 2014 at 8:56 AM, Cédric Villemain ced...@2ndquadrant.com wrote:
 Le samedi 7 juin 2014 08:35:27 Gurjeet Singh a écrit :

 PS: Please note that I am not proposing to add support for the
 optimizer hint embedded in Mitsuru's query.

 :-)

Even though I (sometimes) favor hints, and developed the optimizer
hints feature in EDB (PPAS), I know how much Postgres **hates** [1]
optimizer hints :) So just trying to wade off potential flamewar-ish
comments.

[1]: http://wiki.postgresql.org/wiki/Todo#Features_We_Do_Not_Want

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-06 Thread Gurjeet Singh
On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila amit.kapil...@gmail.com wrote:

 Another thing is don't you want to handle SIGQUIT signal in bg saver?

I think bgworker_quickdie registered in StartBackgroundWorker() serves
the purpose just fine.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-06 Thread Gurjeet Singh
On Thu, Jun 5, 2014 at 11:32 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Thu, Jun 5, 2014 at 5:39 PM, Gurjeet Singh gurj...@singh.im wrote:

  On Tue, Jun 3, 2014 at 5:43 PM, Gurjeet Singh gurj...@singh.im wrote:
 Case 2 also won't cause any buffer restores because the save-files are
 created only on clean shutdowons; not on a crash or immediate
 shutdown.

 How do you ensure that buffers are saved only on clean shutdown?

Postmaster sends SIGTERM only in smart or fast shutdown requests.

 Buffer saver process itself can crash while saving or restoring
 buffers.

True. That may lead to partial list of buffers being saved. And the
code in Reader process tries hard to read only valid data, and punts
at the first sight of data that doesn't make sense or on ERROR raised
from Postgres API call.

 IIUC on shutdown request, postmaster will send signal to BG Saver
 and BG Saver will save the buffers and then postmaster will send
 signal to checkpointer to shutdown.  So before writing Checkpoint
 record, BG Saver can crash (it might have saved half the buffers)

Case handled as described above.

 or may BG saver saves buffers, but checkpointer crashes (due to
 power outage or any such thing).

Checkpointer process' crash seems to be irrelevant to Postgres
Hibernator's  workings.

I think you are trying to argue the wording in my claim save-files
are created only on clean shutdowons; not on a crash or immediate
shutdown, by implying that a crash may occur at any time during and
after the BufferSaver processing. I agree the wording can be improved.
How about

... save-files are created only when Postgres is requested to shutdown
in normal (smart or fast) modes.

Note that I am leaving out the mention of crash.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-05 Thread Gurjeet Singh
On Wed, Jun 4, 2014 at 12:54 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Jun 3, 2014 at 5:43 PM, Gurjeet Singh gurj...@singh.im wrote:

 For sizeable shared_buffers size, the restoration of the shared
 buffers can take several seconds.

 Incase of recovery, the shared buffers saved by this utility are
 from previous shutdown which doesn't seem to be of more use
 than buffers loaded by recovery.

I feel the need to enumerate the recovery scenarios we're talking
about so that we're all on the same page.

1) Hot backup (cp/rsync/pg_basebackup/.. while the master was running)
followed by
  1a) recovery using archives or streaming replication.
1a.i) database in hot-standby mode
1a.ii) database not in hot-standby mode, i.e. it's in warm-standby mode.
  1b) minimal recovery, that is, recover only the WAL available in
pg_xlog, then come online.

2) Cold backup of a crashed master, followed by startup of the copy
(causing crash recovery; IMHO same as case 1b above.).

3) Cold backup of clean-shutdown master, followed by startup of the
copy (no recovery).

In cases 1.x there won't be any save-files (*), because the
BlockReader processes remove their respective save-file when they are
done restoring the buffers, So the hot/warm-standby created thus will
not inherit the save-files, and hence post-recovery will not cause any
buffer restores.

Case 2 also won't cause any buffer restores because the save-files are
created only on clean shutdowons; not on a crash or immediate
shutdown.

Case 3, is the sweet spot of pg_hibernator. It will save buffer-list
on shutdown, and restore them when the backup-copy is started
(provided pg_hibernator is installed there).

(*) If a hot-backup is taken immediately after database comes online,
since BlockReaders may still be running and not have deleted the
save-files, the save-files may end up in backup, and hence cause the
recovery-time conflicts we're talking about. This should be rare in
practice, and even when it does happen, at worst it will affect the
initial performance of the cluster.

 I have a feeling the users wouldn't
 like their master database take up to a few minutes to start accepting
 connections.

 I think this is fair point and to address this we can have an option to
 decide when to load buffer's and have default value as load before
 recovery.

Given the above description, I don't think crash/archive recovery is a
concern anymore. But if that corner case is still a concern, I
wouldn't favour making recovery slow by default, and make users of
pg_hibernator pay for choosing to use the extension. I'd prefer the
user explicitly ask for a behaviour that makes startups slow.

 One other point:
 Note that the BuffersSaver process exists at all times, even when this
 parameter is set to `false`. This is to allow the DBA to enable/disable
 the
 extension without having to restart the server. The BufferSaver process
 checks this parameter during server startup and right before shutdown, and
 honors this parameter's value at that time.

 Why can't it be done when user register's the extension by using dynamic
 background facility RegisterDynamicBackgroundWorker?

There's no user interface to this extension except for the 3 GUC
parameters; not even CREATE EXTENSION. The DBA is expected to append
this extension's name in shared_preload_libraries.

Since this extension declares one of its parameters PGC_POSTMASTER, it
can't be loaded via the SQL 'LOAD ' command.

postgres=# load 'pg_hibernator';
FATAL:  cannot create PGC_POSTMASTER variables after startup
FATAL:  cannot create PGC_POSTMASTER variables after startup
The connection to the server was lost. Attempting reset: Succeeded.

Best regards,

PS: I was out sick yesterday, so couldn't respond promptly.
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-05 Thread Gurjeet Singh
On Wed, Jun 4, 2014 at 2:52 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-06-04 14:50:39 -0400, Robert Haas wrote:
 The thing I was concerned about is that the system might have been in
 recovery for months.  What was hot at the time the base backup was
 taken seems like a poor guide to what will be hot at the time of
 promotion. Consider a history table, for example: the pages at the
 end, which have just been written, are much more likely to be useful
 than anything earlier.

 I'd assumed that the hibernation files would simply be excluded from the
 basebackup...

Yes, they will be excluded, provided the BlockReader processes have
finished, because each BlockReader unlinks its save-file after it is
done restoring buffers listed in it.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-05 Thread Gurjeet Singh
On Wed, Jun 4, 2014 at 2:50 PM, Robert Haas robertmh...@gmail.com wrote:
 The thing I was concerned about is that the system might have been in
 recovery for months.  What was hot at the time the base backup was
 taken seems like a poor guide to what will be hot at the time of
 promotion. Consider a history table, for example: the pages at the
 end, which have just been written, are much more likely to be useful
 than anything earlier.

I think you are specifically talking about a warm-standby that runs
recovery for months before being brought online. As described in my
response to Amit, if the base backup used to create that standby was
taken after the BlockReaders had restored the buffers (which should
complete within few minutes of startup, even for large databases),
then there's no concern since the base backup wouldn't contain the
save-files.

If it's a hot-standby, the restore process would start as soon as the
database starts accepting connections, finish soon after, and get
completely out of the way of the normal recovery process. At which
point the buffers populated by the recovery would compete only with
the buffers being requested by backends, which is the normal
behaviour.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-03 Thread Gurjeet Singh
On Tue, Jun 3, 2014 at 7:57 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 29, 2014 at 12:12 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 IMHO, all of these caveats, would affect a very small fraction of
 use-cases and are eclipsed by the benefits this extension provides in
 normal cases.

 I agree with you that there are only few corner cases where evicting
 shared buffers by this utility would harm, but was wondering if we could
 even save those, say if it would only use available free buffers.  I think
 currently there is no such interface and inventing a new interface for this
 case doesn't seem to reasonable unless we see any other use case of
 such a interface.

 It seems like it would be best to try to do this at cluster startup
 time, rather than once recovery has reached consistency.  Of course,
 that might mean doing it with a single process, which could have its
 own share of problems.  But I'm somewhat inclined to think that if
 recovery has already run for a significant period of time, the blocks
 that recovery has brought into shared_buffers are more likely to be
 useful than whatever pg_hibernate would load.

I am not absolutely sure of the order of execution between recovery
process and the BGWorker, but ...

For sizeable shared_buffers size, the restoration of the shared
buffers can take several seconds. I have a feeling the users wouldn't
like their master database take up to a few minutes to start accepting
connections. From my tests [1],  In the 'App after Hibernator' [case]
... This took 70 seconds for reading the ~4 GB database.

[1]: 
http://gurjeet.singh.im/blog/2014/04/30/postgres-hibernator-reduce-planned-database-down-times/

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-03 Thread Gurjeet Singh
On Tue, Jun 3, 2014 at 8:13 AM, Gurjeet Singh gurj...@singh.im wrote:
 On Tue, Jun 3, 2014 at 7:57 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 29, 2014 at 12:12 AM, Amit Kapila amit.kapil...@gmail.com 
 wrote:
 IMHO, all of these caveats, would affect a very small fraction of
 use-cases and are eclipsed by the benefits this extension provides in
 normal cases.

 I agree with you that there are only few corner cases where evicting
 shared buffers by this utility would harm, but was wondering if we could
 even save those, say if it would only use available free buffers.  I think
 currently there is no such interface and inventing a new interface for this
 case doesn't seem to reasonable unless we see any other use case of
 such a interface.

 It seems like it would be best to try to do this at cluster startup
 time, rather than once recovery has reached consistency.  Of course,
 that might mean doing it with a single process, which could have its
 own share of problems.  But I'm somewhat inclined to think that if

Currently pg_hibernator uses ReadBufferExtended() API, and AIUI, that
API requires a database connection//shared-memory attachment, and that
a backend process cannot switch between databases after the initial
connection.

 own share of problems.  But I'm somewhat inclined to think that if
 recovery has already run for a significant period of time, the blocks
 that recovery has brought into shared_buffers are more likely to be
 useful than whatever pg_hibernate would load.

The applications that connect to a standby may have a different access
pattern than the applications that are operating on the master
database. So the buffers that are being restored by startup process
may not be relevant to the workload on the standby.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-06-02 Thread Gurjeet Singh
On Fri, May 30, 2014 at 5:33 PM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Tue, May 27, 2014 at 10:01 PM, Gurjeet Singh gurj...@singh.im wrote:

 When the Postgres server is being stopped/shut down, the `Buffer
 Saver` scans the
 shared-buffers of Postgres, and stores the unique block identifiers of
 each cached
 block to the disk. This information is saved under the 
 `$PGDATA/pg_hibernator/`
 directory. For each of the database whose blocks are resident in shared 
 buffers,
 one file is created; for eg.: `$PGDATA/pg_hibernator/2.postgres.save`.

 This file-naming convention seems a bit fragile. For example, on my
 filesystem (HFS) if I create a database named foo / bar, I'll get a
 complaint like:

 ERROR:  could not open pg_hibernator/5.foo / bar.save: No such file
 or directory

 during shutdown.

Thanks for the report. I have reworked the file naming, and now the
save-file name is simply 'integer.save', so the name of a database
does not affect the file name on disk. Instead, the null-terminated
database name is now written in the file, and read back for use when
restoring the buffers.

Attached is the new version of pg_hibernator, with updated code and README.

Just a heads up for anyone who might have read/reviewed previous
version's code, there's some unrelated trivial code and Makefile
changes as well in this version, which can be easily spotted by a
`diff -r`.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


pg_hibernator.v2.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-05-29 Thread Gurjeet Singh
On May 29, 2014 12:12 AM, Amit Kapila amit.kapil...@gmail.com wrote:

 I agree with you that there are only few corner cases where evicting
 shared buffers by this utility would harm, but was wondering if we could
 even save those, say if it would only use available free buffers.  I think
 currently there is no such interface and inventing a new interface for
this
 case doesn't seem to reasonable unless we see any other use case of
 such a interface.

+1


Re: [HACKERS] Proposing pg_hibernate

2014-05-28 Thread Gurjeet Singh
On Wed, May 28, 2014 at 2:15 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Wed, May 28, 2014 at 7:31 AM, Gurjeet Singh gurj...@singh.im wrote:
   Caveats
 --

 - Buffer list is saved only when Postgres is shutdown in smart and
 fast modes.

 That is, buffer list is not saved when database crashes, or on
 immediate
 shutdown.

 - A reduction in `shared_buffers` is not detected.

 If the `shared_buffers` is reduced across a restart, and if the
 combined
 saved buffer list is larger than the new shared_buffers, Postgres
 Hibernator continues to read and restore blocks even after
 `shared_buffers`
 worth of buffers have been restored.

 How about the cases when shared buffers already contain some
 data:
 a. Before Readers start filling shared buffers, if this cluster wishes
 to join replication as a slave and receive the data from master, then
 this utility might need to evict some buffers filled during startup
 phase.

A cluster that wishes to be a replication standby, it would do so
while it's in startup phase. The BlockReaders are launched immediately
on cluster reaching consistent state, at which point, I presume, in
most of the cases, most of the buffers would be unoccupied. Hence
BlockReaders might evict the occupied buffers, which may be a small
fraction of total shared_buffers.

 b. As soon as the server completes startup (reached consistent
 point), it allows new connections which can also use some shared
 buffers before Reader process could use shared buffers or are you
 planing to change the time when users can connect to database.

The BlockReaders are launched immediately after the cluster reaches
consistent state, that is, just about when it is ready to accept
connections. So yes, there is a possibility that the I/O caused by the
BlockReaders may affect the performance of queries executed right at
cluster startup. But given that the performance of those queries was
anyway going to be low (because of empty shared buffers), and that
BlockReaders tend to cause sequential reads, and that by default
there's only one BlockReader active at a time, I think this won't be a
concern in most of the cases. By the time the shared buffers start
getting filled up, the buffer replacement strategy will evict any
buffers populated by BlockReaders if they are not used by the normal
queries.

In the 'Sample Runs' section of my blog [1], I compared the cases
'Hibernator w/ App' and 'Hibernator then App', which demonstrate that
launching application load while the BlockReaders are active does
cause performance of both to be impacted by each other. But overall
it's a net win for application performance.

 I am not sure if replacing shared buffer contents in such cases can
 always be considered useful.

IMHO, all of these caveats, would affect a very small fraction of
use-cases and are eclipsed by the benefits this extension provides in
normal cases.

[1]: 
http://gurjeet.singh.im/blog/2014/04/30/postgres-hibernator-reduce-planned-database-down-times/

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposing pg_hibernate

2014-05-27 Thread Gurjeet Singh
, and
honors this parameter's value at that time.

To enable/disable Postgres Hibernator at runtime, change the value in
`postgresql.conf` and use `pg_ctl reload` to make Postgres re-read the new
parameter values from `postgresql.conf`.

Default value: `true`.

- `pg_hibernator.parallel`

This parameter controls whether Postgres Hibernator launches the BlockReader
processes in parallel, or sequentially, waiting for current BlockReader to
exit before launching the next one.

When enabled, all the BlockReaders, one for each database, will be launched
simultaneously, and this may cause huge random-read flood on disks if there
are many databases in cluster. This may also cause some BlockReaders to fail
to launch successfully because of `max_worker_processes` limit.

Default value: `false`.

- `pg_hibernator.default_database`

The BufferSaver process needs to connect to a database in order to perform
the database-name lookups etc. This parameter controls which database the
BufferSaver process connects to for performing these operations.

Default value: `postgres`.

Caveats
--

- Buffer list is saved only when Postgres is shutdown in smart and
fast modes.

That is, buffer list is not saved when database crashes, or on immediate
shutdown.

- A reduction in `shared_buffers` is not detected.

If the `shared_buffers` is reduced across a restart, and if the combined
saved buffer list is larger than the new shared_buffers, Postgres
Hibernator continues to read and restore blocks even after `shared_buffers`
worth of buffers have been restored.

FAQ
--

- What is the relationship between `pg_buffercache`, `pg_prewarm`, and
`pg_hibernator`?

They all allow you to do different things with Postgres' shared buffers.

+ pg_buffercahce:

Inspect and show contents of shared buffers

+ pg_prewarm:

Load some table/index/fork blocks into shared buffers. User needs
to tell it which blocks to load.

+ pg_hibernator:

Upon shutdown, save list of blocks stored in shared buffers. Upon
startup, loads those blocks back into shared buffers.

The goal of Postgres Hibernator is to be invisible to the user/DBA.
Whereas with `pg_prewarm` the user needs to know a lot of stuff about
what they really want to do, most likely information gathered via
`pg_buffercahce`.

- Does `pg_hibernate` use either `pg_buffercache` or `pg_prewarm`?

No, Postgres Hibernator works all on its own.

If the concern is, Do I have to install pg_buffercache and pg_prewarm
to use pg_hibernator, the answer is no. pg_hibernator is a stand-alone
extension, although influenced by pg_buffercache and pg_prewarm.

With `pg_prewarm` you can load blocks of **only** the database
you're connected
to. So if you have `N` databases in your cluster, to restore blocks of all
databases, the DBA will have to connect to each database and invoke
`pg_prewarm` functions.

With `pg_hibernator`, DBA isn't required to do anything, let alone
connecting to the database!

- Where can I learn more about it?

There are a couple of blog posts and initial proposal to Postgres
hackers' mailing list. They may provide a better understanding of
Postgres Hibernator.


[Proposal](http://www.postgresql.org/message-id/cabwtf4ui_anag+ybsefunah5z6de9aw2npdy4hryk+m5odx...@mail.gmail.com)

[Introducing Postrges
Hibernator](http://gurjeet.singh.im/blog/2014/02/03/introducing-postgres-hibernator/)

[Demostrating Performance
Benefits](http://gurjeet.singh.im/blog/2014/04/30/postgres-hibernator-reduce-planned-database-down-times/)






On Mon, Feb 3, 2014 at 7:18 PM, Gurjeet Singh gurj...@singh.im wrote:
 Please find attached the pg_hibernate extension. It is a
 set-it-and-forget-it solution to enable hibernation of Postgres
 shared-buffers. It can be thought of as an amalgam of pg_buffercache and
 pg_prewarm.

 It uses the background worker infrastructure. It registers one worker
 process (BufferSaver) to save the shared-buffer metadata when server is
 shutting down, and one worker per database (BlockReader) when restoring the
 shared buffers.

 It stores the buffer metadata under $PGDATA/pg_database/, one file per
 database, and one separate file for global objects. It sorts the list of
 buffers before storage, so that when it encounters a range of consecutive
 blocks of a relation's fork, it stores that range as just one entry, hence
 reducing the storage and I/O overhead.

 On-disk binary format, which is used to create the per-database save-files,
 is defined as:
 1. One or more relation filenodes; stored as rrelfilenode.
 2. Each realtion is followed by one or more fork number; stored as
 fforknumber
 3. Each fork number is followed by one or more block numbers; stored as
 bblocknumber
 4. Each block number is followed by zero or more range numbers; stored

Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2014-04-24 Thread Gurjeet Singh
On Thu, Jul 18, 2013 at 1:54 PM, Gurjeet Singh gurj...@singh.im wrote:

 On Thu, Jul 18, 2013 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:


  Because simpler code is less likely to have bugs and is easier to
  maintain.

 I agree with that point, but one should also remember Polya's Inventor's
 Paradox: the more general problem may be easier to solve.  That is, if
 done right, code that fully flattens an AND tree might actually be
 simpler than code that does just a subset of the transformation.  The
 current patch fails to meet this expectation,


 The current patch does completely flatten any type of tree
 (left/right-deep or bushy) without recursing, and right-deep and bushy tree
 processing is what Robert is recommending to defer to recursive processing.
 Maybe I haven't considered a case where it doesn't flatten the tree; do you
 have an example in mind.


 but maybe you just haven't
 thought about it the right way.


I tried to eliminate the 'pending' list, but I don't see a way around it.
We need temporary storage somewhere to store the branches encountered on
the right; in recursion case the call stack was serving that purpose.



 My concerns about this patch have little to do with that, though, and
 much more to do with the likelihood that it breaks some other piece of
 code that is expecting AND/OR to be strictly binary operators, which
 is what they've always been in parsetrees that haven't reached the
 planner.  It doesn't appear to me that you've done any research on that
 point whatsoever


 No, I haven't, and I might not be able to research it for a few more weeks.


There are about 30 files (including optimizer and executor) that match
case-insensitive search for BoolExpr, and I scanned those for the usage of
the member 'args'. All the instances where BoolExpr.args is being accessed,
it's being treated as a null-terminated list. There's one exception that I
could find, and it was in context of NOT expression: not_clause() in
clauses.c.




 you have not even updated the comment for BoolExpr
 (in primnodes.h) that this patch falsifies.


 I will fix that.


I think this line in that comment already covers the fact that in some
special cases a BoolExpr may have more than 2 arguments.

There are also a few special cases where more arguments can appear before
optimization.

I have updated the comment nevertheless, and removed another comment in
parse_expr.c that claimed to be the only place where a BoolExpr with more
than 2 args is generated.

I have isolated the code for right-deep and bushy tree processing via the
macro PROCESS_BUSHY_TREES. Also, I have shortened some variable names while
retaining their meaning.

Please find the updated patch attached (based on master).

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com
diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 81c9338..eb35d70 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -41,8 +41,7 @@ bool		Transform_null_equals = false;
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
 static Node *transformParamRef(ParseState *pstate, ParamRef *pref);
 static Node *transformAExprOp(ParseState *pstate, A_Expr *a);
-static Node *transformAExprAnd(ParseState *pstate, A_Expr *a);
-static Node *transformAExprOr(ParseState *pstate, A_Expr *a);
+static Node *transformAExprAndOr(ParseState *pstate, A_Expr *a);
 static Node *transformAExprNot(ParseState *pstate, A_Expr *a);
 static Node *transformAExprOpAny(ParseState *pstate, A_Expr *a);
 static Node *transformAExprOpAll(ParseState *pstate, A_Expr *a);
@@ -224,10 +223,10 @@ transformExprRecurse(ParseState *pstate, Node *expr)
 		result = transformAExprOp(pstate, a);
 		break;
 	case AEXPR_AND:
-		result = transformAExprAnd(pstate, a);
+		result = transformAExprAndOr(pstate, a);
 		break;
 	case AEXPR_OR:
-		result = transformAExprOr(pstate, a);
+		result = transformAExprAndOr(pstate, a);
 		break;
 	case AEXPR_NOT:
 		result = transformAExprNot(pstate, a);
@@ -918,32 +917,102 @@ transformAExprOp(ParseState *pstate, A_Expr *a)
 	return result;
 }
 
+/*
+ * Transform the AND/OR trees non-recursively.
+ *
+ * The parser turns a list of consecutive AND expressions into a left-deep tree.
+ *
+ * a AND b AND c
+ *
+ *  AND
+ * /  \
+ *   AND   c
+ *  /  \
+ * ab
+ *
+ * For very long lists, it gets deep enough that processing it recursively causes
+ * check_stack_depth() to raise error and abort the query. Hence, it is necessary
+ * that we process these trees iteratively.
+ */
 static Node *
-transformAExprAnd(ParseState *pstate, A_Expr *a)
+transformAExprAndOr(ParseState *pstate, A_Expr *a)
 {
-	Node	   *lexpr = transformExprRecurse(pstate, a-lexpr);
-	Node	   *rexpr = transformExprRecurse(pstate, a-rexpr);
+#define PROCESS_BUSHY_TREES 1
+	List		   *exprs = NIL;
+#if PROCESS_BUSHY_TREES

[HACKERS] Patch to send transaction commit/rollback stats to the stats collector unconditionally.

2014-03-19 Thread Gurjeet Singh
Please find attached the patch to send transaction commit/rollback stats to
stats collector unconditionally.

Without this patch, the transactions that do not read from/write to a
database table do not get reported to the stats collector until the client
disconnects. Hence the transaction counts in pg_stat_database do not
increment gradually as one would expect them to. But when such a backend
disconnects, the counts jump dramatically, giving the impression that the
database processed a lot of transactions (potentially thousands) in an
instant.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com
commit 500d11f96b21552872bad4e9655bd45db28efabd
Author: Gurjeet Singh gurj...@singh.im
Date:   Wed Mar 19 13:41:55 2014 -0400

Send transaction stats to the collector even if no table stats to
report.

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3dc280a..47008ed 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -825,11 +825,11 @@ pgstat_report_stat(bool force)
 	}
 
 	/*
-	 * Send partial messages.  If force is true, make sure that any pending
-	 * xact commit/abort gets counted, even if no table stats to send.
+	 * Send partial messages.  Make sure that any pending xact commit/abort gets
+	 * counted, even if no table stats to send.
 	 */
 	if (regular_msg.m_nentries  0 ||
-		(force  (pgStatXactCommit  0 || pgStatXactRollback  0)))
+		force || (pgStatXactCommit  0 || pgStatXactRollback  0))
 		pgstat_send_tabstat(regular_msg);
 	if (shared_msg.m_nentries  0)
 		pgstat_send_tabstat(shared_msg);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to send transaction commit/rollback stats to the stats collector unconditionally.

2014-03-19 Thread Gurjeet Singh
On Wed, Mar 19, 2014 at 4:22 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh gurj...@singh.im writes:
  Please find attached the patch to send transaction commit/rollback stats
 to
  stats collector unconditionally.

 That's intentional to reduce stats traffic.  What kind of performance
 penalty does this patch impose?  If the number of such transactions is
 large enough to create a noticeable jump in the counters, I would think
 that this would be a pretty darn expensive fix.


I can't speak to the performance impact of this patch, except that it would
depend on the fraction of transactions that behave this way. Perhaps the
people who develop and/or aggressively use monitoring can pitch in.

Presumably, on heavily used systems these transactions would form a small
fraction. On relatively idle systems these transactions may be a larger
fraction but that wouldn't affect the users since the database is not under
stress anyway.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com


Re: [HACKERS] Patch to send transaction commit/rollback stats to the stats collector unconditionally.

2014-03-19 Thread Gurjeet Singh
On Wed, Mar 19, 2014 at 7:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  I'm not sure I understand the point of this whole thing.  Realistically,
  how many transactions are there that do not access any database tables?

 I think that something like select * from pg_stat_activity might not
 bump any table-access counters, once the relevant syscache entries had
 gotten loaded.  You could imagine that a monitoring app would do a long
 series of those and nothing else (whether any actually do or not is a
 different question).


+1. This is exactly what I was doing; querying pg_stat_database from a psql
session, to track transaction counters.



 But still, it's a bit hard to credit that this patch is solving any real
 problem.  Where's the user complaints about the existing behavior?


Consider my original email a user complaint, albeit with a patch attached.
I was trying to ascertain TPS on a customer's instance when I noticed this
behaviour; it violated POLA. Had I not decided to dig through the code to
find the source of this behaviour, as a regular user I would've reported
this anomaly as a bug, or maybe turned a blind eye to it labeling it as a
quirky behaviour.


 That is, even granting that anybody has a workload that acts like this,
 why would they care ...


To avoid surprises!

This behaviour, at least in my case, causes Postgres to misreport the very
thing I am trying to calculate.


 and are they prepared to take a performance hit
 to avoid the counter jump after the monitoring app exits?


It doesn't look like we know how much of a  performance hit this will
cause. I don't see any reasons cited in the code, either. Could this be a
case of premature optimization. The commit log shows that, during the
overhaul (commit 77947c5), this check was put in place to emulate what the
previous code did; code before that commit reported stats, including
transaction stats, only if there were any regular or shared table stats to
report.

Besides, there's already a throttle built in using the PGSTAT_STAT_INTERVAL
limit.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com


Re: [HACKERS] Cleaner build output when not much has changed

2014-03-11 Thread Gurjeet Singh
On Mon, Mar 10, 2014 at 8:12 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote:

 Gurjeet Singh wrote:
  On Tue, Nov 26, 2013 at 12:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
   Gurjeet Singh gurj...@singh.im writes:
I was looking for ways to reduce the noise in Postgres make output,
specifically, I wanted to eliminate the Nothing to be done for
 `all' 
messages, since they don't add much value, and just ad to the
 clutter.
  
   Why don't you just use make -s if you don't want to see that?
   The example output you show is not much less verbose than before.
 
  I have a shell function that now adds --no-print-directory to my make
  command. This patch combined with that switch makes the output really
 clean
  (at least from my perspective). Since the use of a command-line switch
 can
  be easily left to personal choice, I am not proposing to add that or its
  makefile-equivalent. But modifying the makefiles to suppress noise is not
  that everyone can be expected to do, and do it right.

 FWIW you can add a src/Makefile.custom file with this:

 all:
 @true

 and it will get rid of most noise.


As I noted in the first email in this chain, this causes a warning:

GNUmakefile:14: warning: overriding commands for target `all'
/home/gurjeet/dev/POSTGRES/src/Makefile.custom:2: warning: ignoring old
commands for target `all'

I have since settled for `make -s`. On slow builds it keeps me guessing for
a long time, without any indication of progress, but I've learned to live
with that.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com


[HACKERS] Proposing pg_hibernate

2014-02-03 Thread Gurjeet Singh
Please find attached the pg_hibernate extension. It is a
set-it-and-forget-it solution to enable hibernation of Postgres
shared-buffers. It can be thought of as an amalgam of pg_buffercache and
pg_prewarm.

It uses the background worker infrastructure. It registers one worker
process (BufferSaver) to save the shared-buffer metadata when server is
shutting down, and one worker per database (BlockReader) when restoring the
shared buffers.

It stores the buffer metadata under $PGDATA/pg_database/, one file per
database, and one separate file for global objects. It sorts the list of
buffers before storage, so that when it encounters a range of consecutive
blocks of a relation's fork, it stores that range as just one entry, hence
reducing the storage and I/O overhead.

On-disk binary format, which is used to create the per-database save-files,
is defined as:
1. One or more relation filenodes; stored as rrelfilenode.
2. Each realtion is followed by one or more fork number; stored as
fforknumber
3. Each fork number is followed by one or more block numbers; stored as
bblocknumber
4. Each block number is followed by zero or more range numbers; stored as
Nnumber

  {r {f {b N* }+ }+ }+

Possible enhancements:
- Ability to save/restore only specific databases.
- Control how many BlockReaders are active at a time; to avoid I/O storms.
- Be smart about lowered shared_buffers across the restart.
- Different modes of reading like pg_prewarm does.
- Include PgFincore functionality, at least for Linux platforms.

The extension currently works with PG 9.3, and may work on 9.4 without any
changes; I haven't tested, though.  If not, I think it'd be easy to port to
HEAD/PG 9.4. I see that 9.4 has put a cap on maximum background workers via
a GUC, and since my aim is to provide a non-intrusive no-tuning-required
extension, I'd like to use the new dynamic-background-worker infrastructure
in 9.4, which doesn't seem to have any preset limits (I think it's limited
by max_connections, but I may be wrong). I can work on 9.4 port, if there's
interest in including this as a contrib/ module.

To see the extension in action:

.) Compile it.
.) Install it.
.) Add it to shared_preload_libraries.
.) Start/restart Postgres.
.) Install pg_buffercache extension, to inspect the shared buffers.
.) Note the result of pg_buffercache view.
.) Work on your database to fill up the shared buffers.
.) Note the result of pg_buffercache view, again; there should be more
blocks than last time we checked.
.) Stop and start the Postgres server.
.) Note the output of pg_buffercache view; it should contain the blocks
seen just before the shutdown.
.) Future server restarts will automatically save and restore the blocks in
shared-buffers.

The code is also available as Git repository at
https://github.com/gurjeet/pg_hibernate/

Demo:

$ make -C contrib/pg_hibernate/
$ make -C contrib/pg_hibernate/ install
$ vi $B/db/data/postgresql.conf
$ grep shared_preload_libraries $PGDATA/postgresql.conf
shared_preload_libraries = 'pg_hibernate'   # (change requires restart)

$ pgstart
waiting for server to start done
server started

$ pgsql -c 'create extension pg_buffercache;'
CREATE EXTENSION

$ pgsql -c 'select count(*) from pg_buffercache where relblocknumber is not
null group by reldatabase;'
 count
---
   163
14
(2 rows)

$ pgsql -c 'create table test_hibernate as select s as a, s::char(1000) as
b from generate_series(1, 10) as s;'
SELECT 10

$ pgsql -c 'create index on test_hibernate(a);'
CREATE INDEX

$ pgsql -c 'select count(*) from pg_buffercache where relblocknumber is not
null group by reldatabase;'
 count
---
  2254
14
(2 rows)

$ pgstop
waiting for server to shut down... done
server stopped

$ pgstart
waiting for server to start done
server started

$ pgsql -c 'select count(*) from pg_buffercache where relblocknumber is not
null group by reldatabase;'
 count
---
  2264
17
(2 rows)

There are a few more blocks than the time they were saved, but all the
blocks from before the restart are present in shared buffers after the
restart.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com


pg_hibernate.tgz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Minor improvements to sslinfo contrib module

2014-01-17 Thread Gurjeet Singh
Please find attached the patch that fixes a couple of comments, and adds
'static' qualifier to functions that are not used anywhere else in the code
base.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com
diff --git a/contrib/sslinfo/sslinfo.c b/contrib/sslinfo/sslinfo.c
index 7a58470..18405f6 100644
--- a/contrib/sslinfo/sslinfo.c
+++ b/contrib/sslinfo/sslinfo.c
@@ -31,9 +31,10 @@ Datum		ssl_client_dn_field(PG_FUNCTION_ARGS);
 Datum		ssl_issuer_field(PG_FUNCTION_ARGS);
 Datum		ssl_client_dn(PG_FUNCTION_ARGS);
 Datum		ssl_issuer_dn(PG_FUNCTION_ARGS);
-Datum		X509_NAME_field_to_text(X509_NAME *name, text *fieldName);
-Datum		X509_NAME_to_text(X509_NAME *name);
-Datum		ASN1_STRING_to_text(ASN1_STRING *str);
+
+static Datum	X509_NAME_field_to_text(X509_NAME *name, text *fieldName);
+static Datum	X509_NAME_to_text(X509_NAME *name);
+static Datum	ASN1_STRING_to_text(ASN1_STRING *str);
 
 
 /*
@@ -51,7 +52,7 @@ ssl_is_used(PG_FUNCTION_ARGS)
 
 
 /*
- * Returns SSL cipher currently in use.
+ * Returns SSL version currently in use.
  */
 PG_FUNCTION_INFO_V1(ssl_version);
 Datum
@@ -77,7 +78,7 @@ ssl_cipher(PG_FUNCTION_ARGS)
 
 
 /*
- * Indicates whether current client have provided a certificate
+ * Indicates whether current client provided a certificate
  *
  * Function has no arguments.  Returns bool.  True if current session
  * is SSL session and client certificate is verified, otherwise false.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shave a few instructions from child-process startup sequence

2013-12-01 Thread Gurjeet Singh
On Wed, Nov 27, 2013 at 9:12 AM, Robert Haas robertmh...@gmail.com wrote:


 This is a performance patch, so it should come with benchmark results
 demonstrating that it accomplishes its intended purpose.  I don't see
 any.


Yes, this is a performance patch, but as the subject says, it saves a few
instructions. I don't know how to write a test case that can measure
savings of skipping a few instructions in a startup sequence that
potentially takes thousands, or even millions, of instructions.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB Corp. www.EnterpriseDB.com http://www.enterprisedb.com


[HACKERS] Cleaner build output when not much has changed

2013-11-26 Thread Gurjeet Singh
I was looking for ways to reduce the noise in Postgres make output,
specifically, I wanted to eliminate the Nothing to be done for `all' 
messages, since they don't add much value, and just ad to the clutter.

Most of the solutions I have seen propose grepping out the noisy parts. But
one of them proposed adding a .PHONY rule and then adding a no-op command
to a target's recipe. Attached is a small patch to a few makefiles which
helps remove the above mentioned message. Following is a sample output:

...
make[3]: Entering directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/bin/initdb'
make -C ../../../src/port all
make[4]: Entering directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/port'
make -C ../backend submake-errcodes
make[5]: Entering directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend'
make[5]: Leaving directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend'
make[4]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/port'
make -C ../../../src/common all
make[4]: Entering directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/common'
make -C ../backend submake-errcodes
make[5]: Entering directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend'
make[5]: Leaving directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/backend'
make[4]: Leaving directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/common'
make[3]: Leaving directory
`/home/gurjeet/dev/pgdbuilds/quiet_make/src/bin/initdb'
make[2]: Leaving directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/bin'
make -C pl all
make[2]: Entering directory `/home/gurjeet/dev/pgdbuilds/quiet_make/src/pl'
...

The noise can be further reduced by adding the --no-print-directory switch,
which yeilds the following output:

...
make -C ../backend submake-errcodes
make -C psql all
make -C ../../../src/interfaces/libpq all
make -C ../../../src/port all
make -C ../backend submake-errcodes
make -C ../../../src/common all
make -C ../backend submake-errcodes
make -C scripts all
make -C ../../../src/interfaces/libpq all
make -C ../../../src/port all
...

The only downside I see to this patch is that it emits this warning in the
beginning:

GNUmakefile:14: warning: overriding commands for target `all'
src/Makefile.global:29: warning: ignoring old commands for target `all'

This is from the recipe that emits the message All of PostgreSQL
successfully made. Ready to install.

For really quiet builds one can use the -s switch, but for someone who
wishes to see some kind of progress and also want a cleaner terminal
output, the --no-print-directory switch alone is not enough.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 8bfb77d..d0928d5 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -26,6 +26,7 @@ standard_always_targets = distprep clean distclean maintainer-clean
 
 # make `all' the default target
 all:
+	@true
 
 # Delete target files if the command fails after it has
 # started to update the file.
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 318cdbc..cbc6d55 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -124,6 +124,7 @@ submake-schemapg:
 
 # src/port needs a convenient way to force errcodes.h to get built
 submake-errcodes: $(top_builddir)/src/include/utils/errcodes.h
+	@true
 
 .PHONY: submake-schemapg submake-errcodes
 
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index c4d3f3c..ee8fbcc 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -55,7 +55,9 @@ postgres.description: postgres.bki ;
 
 postgres.shdescription: postgres.bki ;
 
+.PHONY: schemapg.h
 schemapg.h: postgres.bki ;
+	@true
 
 # Technically, this should depend on Makefile.global, but then
 # postgres.bki would need to be rebuilt after every configure run,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cleaner build output when not much has changed

2013-11-26 Thread Gurjeet Singh
On Tue, Nov 26, 2013 at 12:37 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh gurj...@singh.im writes:
  I was looking for ways to reduce the noise in Postgres make output,
  specifically, I wanted to eliminate the Nothing to be done for `all' 
  messages, since they don't add much value, and just ad to the clutter.

 Why don't you just use make -s if you don't want to see that?
 The example output you show is not much less verbose than before.


I have a shell function that now adds --no-print-directory to my make
command. This patch combined with that switch makes the output really clean
(at least from my perspective). Since the use of a command-line switch can
be easily left to personal choice, I am not proposing to add that or its
makefile-equivalent. But modifying the makefiles to suppress noise is not
that everyone can be expected to do, and do it right.


 I'm pretty suspicious of cute changes like this to the makefiles.
 They too often have unexpected side-effects.  (I'm still pissed off
 about having to manually remove objfiles.txt to get it to rebuild a .o
 file, for instance.)


You mean the --enable-depend switch to ./configure is not sufficient to
force a rebuild on changed source code! With this switch, I have always
seen my builds do the right thing, whether I modify a .c file or a .h. I
personally have never been forced to remove objfiles.txt to solve a
build/make issue. (I primarily use VPATH builds, BTW).

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB www.EnterpriseDB.com http://www.enterprisedb.com


Re: [HACKERS] Shave a few instructions from child-process startup sequence

2013-11-26 Thread Gurjeet Singh
On Tue, Nov 26, 2013 at 9:42 PM, Peter Eisentraut pete...@gmx.net wrote:

 src/backend/postmaster/postmaster.c:2255: indent with spaces.
 +else
 src/backend/postmaster/postmaster.c:2267: indent with spaces.
 +break


Not sure how that happened! Attached is the updated patch.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EnterprsieDB Inc. www.enterprisedb.com
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccb8b86..cdae6e5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2236,6 +2236,20 @@ ClosePostmasterPorts(bool am_syslogger)
 			StreamClose(ListenSocket[i]);
 			ListenSocket[i] = PGINVALID_SOCKET;
 		}
+		else
+		{
+			/*
+			 * Do not process the rest of the array elements since we expect
+			 * the presence of an invalid socket id to mark the end of valid
+			 * elements.
+			 */
+#ifdef USE_ASSERT_CHECKING
+			int j;
+			for(j = i; j  MAXLISTEN; j++)
+Assert(ListenSocket[i] == PGINVALID_SOCKET);
+#endif
+			break;
+		}
 	}
 
 	/* If using syslogger, close the read side of the pipe */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shave a few instructions from child-process startup sequence

2013-11-22 Thread Gurjeet Singh
On Wed, Nov 20, 2013 at 10:21 AM, Peter Eisentraut pete...@gmx.net wrote:

 On 11/5/13, 2:47 AM, Gurjeet Singh wrote:
  On Mon, Nov 4, 2013 at 12:20 AM, Tom Lane t...@sss.pgh.pa.us
  mailto:t...@sss.pgh.pa.us wrote:
 
  But we're not buying much.  A few instructions during postmaster
  shutdown
  is entirely negligible.
 
 
  The patch is for ClosePostmasterPorts(), which is called from every
  child process startup sequence (as $subject also implies), not in
  postmaster shutdown. I hope that adds some weight to the argument.

 If there is a concern about future maintenance, you could add assertions
 (in appropriate compile mode) that the rest of the array is indeed
 PGINVALID_SOCKET.  I think that could be a win for both performance and
 maintainability.


Makes sense! Does the attached patch look like what you expected? I also
added a comment to explain the expectation.

Thanks and best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EDB Inc. www.EnterpriseDB.com http://www.enterprisedb.com
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccb8b86..9efc9fa 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2236,6 +2236,20 @@ ClosePostmasterPorts(bool am_syslogger)
 			StreamClose(ListenSocket[i]);
 			ListenSocket[i] = PGINVALID_SOCKET;
 		}
+else
+		{
+			/*
+			 * Do not process the rest of the array elements since we expect
+			 * the presence of an invalid socket id to mark the end of valid
+			 * elements.
+			 */
+#ifdef USE_ASSERT_CHECKING
+			int j;
+			for(j = i; j  MAXLISTEN; j++)
+Assert(ListenSocket[i] == PGINVALID_SOCKET);
+#endif
+break;
+		}
 	}
 
 	/* If using syslogger, close the read side of the pipe */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol

2013-11-20 Thread Gurjeet Singh
On Wed, Nov 20, 2013 at 3:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:


 To my mind, the create a socket and hope nobody else can get to it
 approach is exactly one of the main things we're trying to avoid here.
 If you'll recall, awhile back we had a big discussion about how pg_upgrade
 could positively guarantee that nobody messed with the source database
 while it was working, and we still don't have a bulletproof guarantee
 there.  I would like to fix that by making pg_upgrade use only standalone
 backends to talk to the source database, never starting a real postmaster
 at all.  But if the standalone-pg_dump mode goes through a socket, we're
 back to square one on that concern.


(I couldn't find the pg_upgrade-related thread mentioned above).

I am not sure of the mechanics of this, but can we not launch the
postmaster with a random magic-cookie, and use that cookie while initiating
the connection from libpq. The postmaster will then reject any connections
that don't provide the cookie.

We do something similar to enable applications to send cancellation signals
(postmaster.c:Backend.cancel_key), just that it's establishing trust in the
opposite direction.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/

EnterprsieDB Inc. www.enterprisedb.com


Re: [HACKERS] Shave a few instructions from child-process startup sequence

2013-11-04 Thread Gurjeet Singh
On Mon, Nov 4, 2013 at 12:20 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 But we're not buying much.  A few instructions during postmaster shutdown
 is entirely negligible.


The patch is for ClosePostmasterPorts(), which is called from every child
process startup sequence (as $subject also implies), not in postmaster
shutdown. I hope that adds some weight to the argument.

Best regards,
-- 
Gurjeet Singh   gurjeet.singh.im
EnterpriseDB Inc. www.enterprisedb.com


Re: [HACKERS] Shave a few instructions from child-process startup sequence

2013-11-01 Thread Gurjeet Singh
On Thu, Oct 31, 2013 at 11:20 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Amit Kapila amit.kapil...@gmail.com writes:
  On Thu, Oct 31, 2013 at 2:41 AM, Gurjeet Singh gurj...@singh.im wrote:
  Just a small patch; hopefully useful.

  This is valid saving as we are filling array ListenSocket[] in
  StreamServerPort() serially, so during ClosePostmasterPorts() once if
  it encountered PGINVALID_SOCKET, it is valid to break the loop.
  Although savings are small considering this doesn't occur in any
  performance path, still I think this is right thing to do in code.

  It is better to register this patch in CF app list, unless someone
  feels this is not right.

 I think this is adding fragility for absolutely no meaningful savings.
 The existing code does not depend on the assumption that the array
 is filled consecutively and no entries are closed early.  Why should
 we add such an assumption here?


IMHO, typical configurations have one, or maybe two of these array elements
populated; one for TCP 5432 (for localhost or *), and the other for Unix
Domain Sockets. Saving 62 iterations and comparisons in startup sequence
may not be much, but IMHO it does match logic elsewhere.

In fact, this was inspired by the following code in ServerLoop():

ServerLoop()
{
...
/*
 * New connection pending on any of our sockets? If so, fork a child
 * process to deal with it.
 */
if (selres  0)
{
int i;

for (i = 0; i  MAXLISTEN; i++)
{
if (ListenSocket[i] == PGINVALID_SOCKET)
break;
if (FD_ISSET(ListenSocket[i], rmask))
{


And looking for other precedences, I found that initMasks() function also
relies on the array being filled consecutively and the first
PGINVALID_SOCKET marks end of valid array members.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


[HACKERS] Shave a few instructions from child-process startup sequence

2013-10-30 Thread Gurjeet Singh
Just a small patch; hopefully useful.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index ccb8b86..48dc7af 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2236,6 +2236,8 @@ ClosePostmasterPorts(bool am_syslogger)
 			StreamClose(ListenSocket[i]);
 			ListenSocket[i] = PGINVALID_SOCKET;
 		}
+else
+break;
 	}
 
 	/* If using syslogger, close the read side of the pipe */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-18 Thread Gurjeet Singh
On Thu, Jul 18, 2013 at 10:19 AM, Tom Lane t...@sss.pgh.pa.us wrote:


  Because simpler code is less likely to have bugs and is easier to
  maintain.

 I agree with that point, but one should also remember Polya's Inventor's
 Paradox: the more general problem may be easier to solve.  That is, if
 done right, code that fully flattens an AND tree might actually be
 simpler than code that does just a subset of the transformation.  The
 current patch fails to meet this expectation,


The current patch does completely flatten any type of tree (left/right-deep
or bushy) without recursing, and right-deep and bushy tree processing is
what Robert is recommending to defer to recursive processing. Maybe I
haven't considered a case where it doesn't flatten the tree; do you have an
example in mind.


 but maybe you just haven't
 thought about it the right way.

 My concerns about this patch have little to do with that, though, and
 much more to do with the likelihood that it breaks some other piece of
 code that is expecting AND/OR to be strictly binary operators, which
 is what they've always been in parsetrees that haven't reached the
 planner.  It doesn't appear to me that you've done any research on that
 point whatsoever


No, I haven't, and I might not be able to research it for a few more weeks.


 you have not even updated the comment for BoolExpr
 (in primnodes.h) that this patch falsifies.


I will fix that.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-17 Thread Gurjeet Singh
On Tue, Jul 16, 2013 at 4:04 PM, Pavel Stehule pavel.steh...@gmail.comwrote:

 I did a some performance tests of v5 and v6 version and there v5 is
 little bit faster than v6, and v6 has significantly higher stddev


Thanks Pavel.

The difference in average seems negligible, but stddev is interesting
because v6 does less work than v5 in common cases and in the test that I
had shared.

The current commitfest (2013-06) is marked as 'In Progress', so is it okay
to just mark the patch as 'Ready for Committer' or should I move it to the
next commitfest (2013-09).

What's the procedure of moving a patch to the next commitfest? Do I make a
fresh submission there with a link to current submission, or is the move
doable somehow in the application itself.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-17 Thread Gurjeet Singh
On Wed, Jul 17, 2013 at 8:21 AM, Gurjeet Singh gurj...@singh.im wrote:


 What's the procedure of moving a patch to the next commitfest?


Never mind, I see an email from Josh B. regarding this on my corporate
account.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-17 Thread Gurjeet Singh
On Wed, Jul 17, 2013 at 1:25 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Jul 15, 2013 at 12:45 AM, Gurjeet Singh gurj...@singh.im wrote:
  Agreed that there's overhead in allocating list items, but is it more
  overhead than pushing functions on the call stack? Not sure, so I leave
 it
  to others who understand such things better than I do.

 If you think that a palloc can ever be cheaper that pushing a frame on
 the callstack, you're wrong.  palloc is not some kind of an atomic
 primitive.  It's implemented by the AllocSetAlloc function, and you're
 going to have to push that function on the call stack, too, in order
 to run it.


Agreed. I take my objection back. Even if AllocSetAlloc() reuses memory
that was pfree'd earlier, it'll still be at least as expensive as recursing.



 My main point here is that if the user writes a = 1 and b = 1 and c =
 1 and d = 1, they're not going to end up with a bushy tree.  They're
 going to end up with a tree that's only deep in one direction (left, I
 guess) and that's the case we might want to consider optimizing.  To
 obtain a bushy tree, they're going to have to write a  = 1 and (b = 1
 and c = 1) and d = 1, or something like that, and I don't see why we
 should stress out about that case.  It will be rare in practice.


In v6 of the  patch, I have deferred the 'pending' list initialization to
until we actually hit a candidate right-branch. So in the common case the
pending list will never be populated, and if we find a bushy or right-deep
tree (for some reason an ORM/tool may choose to build AND/OR lists that may
end being right-deep when in Postgres), then the pending list will be used
to process them iteratively.

Does that alleviate your concern about 'pending' list management causing an
overhead.

Agreed that bushy/right-deep trees are a remote corner case, but we are
addressing a remote corner case in the first place (insanely long AND
lists) and why not handle another remote corner case right now if it
doesn't cause an overhead for common case.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-07-14 Thread Gurjeet Singh
On Sun, Jul 14, 2013 at 8:27 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jul 10, 2013 at 9:02 PM, Josh Berkus j...@agliodbs.com wrote:
  I think it's a waste of code to try to handle bushy trees.  A list is
  not a particularly efficient representation of the pending list; this
  will probably be slower than recusing in the common case.  I'd suggest
  keeping the logic to handle left-deep trees, which I find rather
  elegant, but ditching the pending list.


Somehow I find it hard to believe that recursing would be more efficient
than processing the items right there. The recursion is not direct either;
transformExprRecurse() is going to call this function again, but after a
few more switch-case comparisons.

Agreed that there's overhead in allocating list items, but is it more
overhead than pushing functions on the call stack? Not sure, so I leave it
to others who understand such things better than I do.

If by common-case you mean a list of just one logical AND/OR operator, then
I agree that creating and destroying a list may incur overhead that is
relatively very expensive. To that end, I have altered the patch, attached,
to not build a pending list until we encounter a node with root_expr_kind
in a right branch.

We're getting bushy-tree processing with very little extra code, but if you
deem it not worthwhile or adding complexity, please feel free to rip it out.


  
  Is there going to be further discussion of this patch, or do I return it?

 Considering it's not been updated, nor my comments responded to, in
 almost two weeks, I think we return it at this point.


Sorry, I didn't notice that this patch was put back in  'Waiting on Author'
state.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-06-30 Thread Gurjeet Singh
On Tue, Jun 18, 2013 at 3:01 PM, Pavel Stehule pavel.steh...@gmail.comwrote:


 related to

 https://commitfest.postgresql.org/action/patch_view?id=1130

 http://www.postgresql.org/message-id/cabwtf4v9rsjibwe+87pk83mmm7acdrg7sz08rq-4qyme8jv...@mail.gmail.com


 * motivation: remove recursive procession of AND/OR list (hangs with
 10062 and more subexpressions)

 * patch is short, clean and respect postgresql source code requirements
 * patch was applied cleanly without warnings
 * all regression tests was passed
 * I successfully evaluated expression with 10 subexpressions
 * there is no significant slowdown

 possible improvements

 a = (A_Expr*) list_nth(pending, 0);

 a = (A_Expr*) linitial(pending);


I made that change, hesitantly. The comments above definition of linitial()
macro describe the confusion that API causes. I wanted to avoid that
confusion for new code, so I used the newer API which makes the intention
quite clear. But looking at that code closely, list_nth() causes at least 2
function calls, and that's pretty heavy compared to the linitiali() macro.



 not well comment

 should be -- If the right branch is also an SAME condition, append it to
 the


I moved that comment above the outer bock, so that the intention of the
whole do-while code block is described in one place.

I don't see any other issues, so after fixing comments this patch is
 ready for commit


Thanks for the review Pavel.

Attached is the updated patch, v4. It has the above edits, and a few code
improvements, like not repeating the (root_kind == AEPR_AND ? .. :  ..)
ternary expression.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-06-30 Thread Gurjeet Singh
On Sun, Jun 30, 2013 at 11:13 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 Hello

 just one small notices

 I dislike a name root_bool_expr, because, there is not a expression,
 but expression type. Can you use root_bool_expr_type instead? It is
 little bit longer, but more correct. Same not best name is
 root_char, maybe root_bool_op_name

 or root_expr_type and root_op_name ???


How about naming those 3 variables as follows:

root_expr_kind
root_expr_name
root_bool_expr_type


-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-06-30 Thread Gurjeet Singh
On Sun, Jun 30, 2013 at 11:46 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 2013/6/30 Gurjeet Singh gurj...@singh.im:
  On Sun, Jun 30, 2013 at 11:13 AM, Pavel Stehule pavel.steh...@gmail.com
 
  wrote:
 
  How about naming those 3 variables as follows:
 
  root_expr_kind
  root_expr_name
  root_bool_expr_type

 +1


Thanks. Attached is the patch with that change. I'll update the commitfest
entry with a link to this email.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


[HACKERS] Fwd: review: Non-recursive processing of AND/OR lists

2013-06-30 Thread Gurjeet Singh
On Sun, Jun 30, 2013 at 11:46 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 2013/6/30 Gurjeet Singh gurj...@singh.im:
  On Sun, Jun 30, 2013 at 11:13 AM, Pavel Stehule pavel.steh...@gmail.com
 
  wrote:
 
  How about naming those 3 variables as follows:
 
  root_expr_kind
  root_expr_name
  root_bool_expr_type

 +1


Thanks. Attached is the patch with that change. I'll update the commitfest
entry with a link to this email.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.



-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Config reload/restart preview

2013-06-20 Thread Gurjeet Singh
On Thu, Jun 20, 2013 at 9:33 AM, Magnus Hagander mag...@hagander.netwrote:

 On Thu, Jun 20, 2013 at 2:54 PM, Dimitri Fontaine
 dimi...@2ndquadrant.fr wrote:
  Magnus Hagander mag...@hagander.net writes:
  Should we have a way of previewing changes that would be applied if we
  reloaded/restarted the server?
 
  Yes, we should.
 
  +1
 
  This would go well with something I started working on some time ago
  (but haven't actually gotten far on at all), which is the ability for
  pg_ctl to be able to give feedback at all. Meaning a pg_ctl reload
  should also be able to tell you which parameters were changed, without
  having to go to the log. Obviously that's almost exactly the same
  feature.
 
  It could probably connect to the server and issue the SQL command to
  reload, and that one could probably get enhanced to return modified
  variable as NOTICE, or be run with the right client_min_message:
 
  SELECT pg_reload_conf();
 
  The pg_ctl client would then have to know to display the messages sent
  back by the server.

 The problem with that is that now you must *always* have your system
 set up to allow the postgres user to log in in pg_hba.conf or it
 fails.

 But yes, one option would be to use SQL instead of opening a socket.
 Maybe that's a better idea - have pg_ctl try to use that if available,
 and if not send a regular signal and not try to collect the output.


I started working on it yesterday after Thom proposed this idea internally
at EDB. The discussion until now doesn't seem to be hostile to a SQL
interface, so attached is a hack attempt, which hopefully will serve at
least as a POC. A sample session is shown below, while changing a few
values in postgresql.conf files.

The central idea is to use the SIGHUP processing function to do the work
for us and report potential changes via DEBUG2, instead of having to write
a new parsing engine. The (GUC-nesting + PGC_S_TEST) is nice to have since
it avoids the current session from adopting the values that are different
in conf file. This approach is susceptible to the fact that the connected
superuser may have its GUC values picked up from user/database/session
level settings (ALTER USER/DATABASE .. SET ; or SET param TO val;).

$ pgsql
Expanded display is used automatically.
psql (9.4devel)
Type help for help.

postgres=# show work_mem;
 work_mem
--
 1MB
(1 row)

postgres=# set client_min_messages = debug2;
SET
postgres=# select pg_test_reload_conf();
DEBUG:  parameter work_mem changed to 70MB
 pg_test_reload_conf
-
 t
(1 row)

postgres=# show work_mem;
 work_mem
--
 1MB
(1 row)

postgres=# select pg_test_reload_conf();
DEBUG:  parameter shared_buffers cannot be changed without restarting the
server
DEBUG:  configuration file
/home/gurjeet/dev/pgdbuilds/report_guc_chanege_pre_reload/db/data/postgresql.conf
contains errors; unaffected changes were applied
 pg_test_reload_conf
-
 t
(1 row)

postgres=# select pg_test_reload_conf();
DEBUG:  parameter log_min_messages removed from configuration file, reset
to default
 pg_test_reload_conf
-
 t
(1 row)

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


report_conf_changes_without_applying.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: Non-recursive processing of AND/OR lists

2013-06-18 Thread Gurjeet Singh
Thanks for the review Pavel.

On Tue, Jun 18, 2013 at 3:01 PM, Pavel Stehule pavel.steh...@gmail.comwrote:

 Hello

 related to

 https://commitfest.postgresql.org/action/patch_view?id=1130

 http://www.postgresql.org/message-id/cabwtf4v9rsjibwe+87pk83mmm7acdrg7sz08rq-4qyme8jv...@mail.gmail.com


 * motivation: remove recursive procession of AND/OR list (hangs with
 10062 and more subexpressions)

 * patch is short, clean and respect postgresql source code requirements
 * patch was applied cleanly without warnings
 * all regression tests was passed
 * I successfully evaluated expression with 10 subexpressions
 * there is no significant slowdown

 possible improvements

 a = (A_Expr*) list_nth(pending, 0);

 a = (A_Expr*) linitial(pending);

 not well comment

 should be -- If the right branch is also an SAME condition, append it to
 the

 +   /*
 +* If the right branch is also an AND condition,
 append it to the
 +* pending list, to be processed later. This
 allows us to walk even
 +* bushy trees, not just left-deep trees.
 +*/
 +   if (IsA(a-rexpr, A_Expr) 
 ((A_Expr*)a-rexpr)-kind == root_kind)
 +   {
 +   pending = lappend(pending, a-rexpr);
 +   }
 +   else
 +   {
 +   expr = transformExprRecurse(pstate,
 a-rexpr);
 +   expr = coerce_to_boolean(pstate, expr,
 root_kind == AEXPR_AND ?
 AND : OR);
 +   exprs = lcons(expr, exprs);
 +   }

 I don't see any other issues, so after fixing comments this patch is
 ready for commit

 Regards

 Pavel Stehule




-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] Processing long AND/OR lists

2013-06-06 Thread Gurjeet Singh
On Mon, May 27, 2013 at 10:32 AM, Christopher Browne cbbro...@gmail.comwrote:

 On Mon, May 27, 2013 at 1:42 AM, Gurjeet Singh gurj...@singh.im wrote:



 Joking about 640K aside, it doesn't seem reasonable to expect a truly
 enormous query as is generated by the broken forms of this logic to turn
 out happily.  I'd rather fix Slony (as done in the above patch).


 Yes, by all means, fix the application, but that doesn't preclude the
 argument that the database should be a bit more smarter and efficient,
 especially if it is easy to do.


 Agreed, it seems like a fine idea to have the database support such
 queries, as this eases coping with applications that might be more
 difficult to get fixed.


Seeing no more objections to it, I am going to add this patch to the
commitfest. Attached is updated patch against latest master; it's the same
as the previous version, except that that the patch now includes a fix for
the failing test case as well.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgbench: introduce a new automatic variable 'client_number'

2013-06-05 Thread Gurjeet Singh
Please find attached a patch for pgbench that introduces a new
auto-variable 'client_number'. Following in the footsteps of 'scale'
auto-variable, this is not declared if the user has specified this variable
using -D switch.

Since 'clientid' is a very common name a user can use for their own
script's variable, I chose to call this auto-variable client_number; just
to avoid conflicts.

This variable can come in handy when you want to use a different expression
in a query depending on which client is executing it. An example custom
transaction is attached, where the UPDATE statement from any given client
always updates the same logical row.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


pgbench_add_cleint_number_variable.patch
Description: Binary data


test_update.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Processing long AND/OR lists

2013-05-26 Thread Gurjeet Singh
On Sun, May 26, 2013 at 11:46 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Josh Berkus j...@agliodbs.com writes:
  ***15,000***?  I'd say that someone has an application design issue.
  Fixing the stack overflow is a good thing, but that query is never going
  to return ...


Just for the record, it does finish in 5 sec on my laptop. But yes, point
taken.



 Yeah, the parser's stack consumption seems like only the tip of the
 iceberg here.


True. After getting past the parser, the bigger lists consume exponentially
longer times around process_equivalence(). And BTW, a backend stuck in that
stack does not respond to pg_cancel/terminate_backend()! Should there be a
CHECK_FOR_INTERRUPT() in process_equivalence(), perhaps invoked only every
N calls of that function.



 I find it hard to visualize a use-case for so many AND'ed conditions
 anyway.  I could believe a lot of OR'd conditions, but very likely it
 would be a list that could and should be converted to an IN condition.


As noted earlier, this was seen in real-world, at a customer setup,
generated by Slony, a highly regarded community project. Also, the patch
addresses the stack limit for long OR clauses as well.

Anytime such conditions are generated programmatically, there's always a
possibility that the programmer underestimated the amount of data they are
generating the query for; just like it happened in Slony's case I quoted.

Slony has compress_actionseq() function to scan a list of numbers, and
generate a smallest possible WHERE clause. If it sees consecutive numbers
that form a range, it turns that range into a BETWEEN clause, and the
numbers that don't seem to be in a range are added to the where clause as
'AND col  n1 AND col  n2 ...'. It's quite likely that it generates such
long lists because it wrongly assumes that the input list is ordered, or at
least mostly ordered.

Agreed that that function can/should be fixed to do a better job of sorting
these numbers before scanning, and also using the NOT IN construct. But the
point remains that Postgres should be capable of handling this simple
construct efficiently.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] Processing long AND/OR lists

2013-05-26 Thread Gurjeet Singh
My last email was written before reading this. A few episodes of 24
occurred between writing and sending that email.

Added slony1-hackers, but didn't remove pgsql-hackers. Feel free to exclude
pgsql lists, as this branch of conversation seems to be more Slony related
than Postgres.

On Sun, May 26, 2013 at 10:59 PM, Christopher Browne cbbro...@gmail.comwrote:


 In Slony 2.1, the issue re-emerged because the ordering of the action id
 values was lost; the query had previously been implicitly forcing them into
 order; we had to add an ORDER BY clause, to make the compressor work
 again.

 http://git.postgresql.org/gitweb/?p=slony1-engine.git;a=blobdiff;f=src/slon/remote_worker.c;h=b1f48043f8e25b4a74a392b0dbceeae8f3e18c27;hp=7fbf67c16f97cb7c3f209cf3be903ea52c4490a9;hb=c4ac435308a78a2db63bf267d401d842c169e87d;hpb=d4612aab78bac5a9836e3e2425c403878f7091c8


Commit log says it was fixed between  2.1.2, but from the Slony logs at the
time, the version in use was 2.1.2. So


 Joking about 640K aside, it doesn't seem reasonable to expect a truly
 enormous query as is generated by the broken forms of this logic to turn
 out happily.  I'd rather fix Slony (as done in the above patch).


Yes, by all means, fix the application, but that doesn't preclude the
argument that the database should be a bit more smarter and efficient,
especially if it is easy to do.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


[HACKERS] Processing long AND/OR lists

2013-05-25 Thread Gurjeet Singh
When Postgres encounters a long list of AND/OR chains, it errors out at
check_stack_depth() after a limit of few thousand. At around 10,000
elements, the recursion at assign_expr_collations() causes the error. But
at a little higher element count, around 15,000, the recursion check errors
out a little earlier, in the stack around transformAExprAnd(). The test
queries were generated using the attached test.sh script.

This is not a synthetic test. Recently I saw a report where Slony generated
a huge list of 'log_actionseq  nnn and log_actionseq  nnn and ...' for
a SYNC event, and that SYNC event could not complete due to this error.
Granted that Slony can be fixed by making it generate the condition as
'log_actionseq not in (nnn, nnn)' which is processed non-recursively, but
IMHO Postgres should be capable of handling this trivial construct, however
long it may be (of course, limited by memory).

To that end, I wrote the attached patch, and although I seem to be playing
by the rules, `make check` fails. Some diffs look benign, but others not so
much.

AIUI, a BoolExpr is perfectly capable of holding multiple expressions as a
list. And since SQL standard does not say anything about short-circuit
evaluation, the evaluation order of these expressions should not affect the
results. So clearly turning a tree of booleans expressions into a list is
not so subtle a change. I suspect that this patch is messing up the join
conditions.

I see two ways to fix it:
1) Fix the order of expressions in BoolExpr such that the order of
evaluation remains unchanged compared to before the patch.
I expect this to take care of the benign diffs. But I doubt this will
address the other diffs.

2) Construct a tree same as done before the patch, but do it
non-recursively.
This is I guess the right thing to do, but then we'll have to make
similar tree-traversal changes in other places, for eg. in BoolExpr walking
in expression_tree_walker() or maybe just the stack below
assign_expr_collations().

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation.patch
Description: Binary data


test.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Processing long AND/OR lists

2013-05-25 Thread Gurjeet Singh
On Sat, May 25, 2013 at 9:56 AM, Gurjeet Singh gurj...@singh.im wrote:

 When Postgres encounters a long list of AND/OR chains, it errors out at
 check_stack_depth() after a limit of few thousand. At around 10,000
 elements, the recursion at assign_expr_collations() causes the error. But
 at a little higher element count, around 15,000, the recursion check errors
 out a little earlier, in the stack around transformAExprAnd(). The test
 queries were generated using the attached test.sh script.

 This is not a synthetic test. Recently I saw a report where Slony
 generated a huge list of 'log_actionseq  nnn and log_actionseq  nnn and
 ...' for a SYNC event, and that SYNC event could not complete due to this
 error. Granted that Slony can be fixed by making it generate the condition
 as 'log_actionseq not in (nnn, nnn)' which is processed non-recursively,
 but IMHO Postgres should be capable of handling this trivial construct,
 however long it may be (of course, limited by memory).

 To that end, I wrote the attached patch, and although I seem to be playing
 by the rules, `make check` fails. Some diffs look benign, but others not so
 much.

 AIUI, a BoolExpr is perfectly capable of holding multiple expressions as a
 list. And since SQL standard does not say anything about short-circuit
 evaluation, the evaluation order of these expressions should not affect the
 results. So clearly turning a tree of booleans expressions into a list is
 not so subtle a change. I suspect that this patch is messing up the join
 conditions.

 I see two ways to fix it:
 1) Fix the order of expressions in BoolExpr such that the order of
 evaluation remains unchanged compared to before the patch.
 I expect this to take care of the benign diffs. But I doubt this will
 address the other diffs.

 2) Construct a tree same as done before the patch, but do it
 non-recursively.
 This is I guess the right thing to do, but then we'll have to make
 similar tree-traversal changes in other places, for eg. in BoolExpr walking
 in expression_tree_walker() or maybe just the stack below
 assign_expr_collations().


As suspected, the problem was that a JOIN's USING clause processing cooks
up its own join conditions, and those were the ones that couldn't cope with
the changed order of output.

Fixing that order of arguments of the BoolExpr also fixed all the JOIN
related diffs. Now `make check` passes except for this benign diff.

  |   WHERE (((rsl.sl_color = rsh.slcolor)
AND (rsl.sl_len_cm = rsh.slminlen_cm)) AND (rsl.sl_len_cm =
rsh.slmaxlen_cm));
  |   WHERE ((rsl.sl_color = rsh.slcolor)
AND (rsl.sl_len_cm = rsh.slminlen_cm) AND (rsl.sl_len_cm =
rsh.slmaxlen_cm));

That's expected, since for cases like these, the patch converts what used
to be a tree of 2-argument BoolExprs into a single BoolExpr with many
arguments.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


non_recursive_and_or_transformation_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to make pgindent work cleanly

2013-04-12 Thread Gurjeet Singh
On Fri, Apr 12, 2013 at 11:44 AM, Bruce Momjian br...@momjian.us wrote:

 On Tue, Feb 19, 2013 at 04:50:45PM -0500, Gurjeet Singh wrote:
  Please find attached the patch for some cleanup and fix bit rot in
 pgindent
  script.
 
  There were a few problems with the script.
 
  .) It failed to use the $ENV{PGENTAB} even if it was set.
  .) The file it tries to download from Postgres' ftp site is no longer
 present.
  ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz
  .) The tar command extracted the above-mentioned file to a child
 directory, but
  did not descend into it before running make on it.
  I think it expected a tarbomb, but clearly the current .tgz file on
 ftp
  site is not a tarbomb.
 
  I don't like the fact that it dies with a message fetching xyz rather
 than
  saying Could not fetch xyz, but this patch does not address that since
 it
  doesn't really affect the output when script does work.
 
  With this patch in place I could very easily run it on any arbitrary
 file,
  which seemed like a black-magic before the patch.
 
  src/tools/pgindent/pgindent --build your file path here

 I have applied this patch.  However, I modified the tarball name to
 reference $INDENT_VERSION, rather than hard-coding 1.2.


Thanks!

Can you also improve the output when it dies upon failure to fetch
something? Currently the only error message it emits is fetching xyz, and
leaves the user confused as to what really the problem was. The only
indication of a problem might be the exit code,  but I'm not sure of that,
and that doesn't help the interactive user running it on terminal.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] Patch to make pgindent work cleanly

2013-04-12 Thread Gurjeet Singh
On Fri, Apr 12, 2013 at 3:26 PM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Apr 12, 2013 at 01:34:49PM -0400, Gurjeet Singh wrote:
  Can you also improve the output when it dies upon failure to fetch
 something?
  Currently the only error message it emits is fetching xyz, and leaves
 the
  user confused as to what really the problem was. The only indication of a
  problem might be the exit code,  but I'm not sure of that, and that
 doesn't
  help the interactive user running it on terminal.

 Good point.  I have reviewed all the error messages and improved them
 with the attached, applied patch, e.g.:

 cannot locate typedefs file xyz at /usr/local/bin/pgindent line
 121.


Thanks!

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading

2013-04-11 Thread Gurjeet Singh
On Wed, Apr 10, 2013 at 11:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh gurj...@singh.im writes:
  So, again, it is not guaranteed that all the scans on a relation will
  synchronize with each other. Hence my proposal to include the term
  'probability' in the definition.

 Yeah, it's definitely not guaranteed in any sense.  But I don't really
 think your proposed wording is an improvement.  The existing wording
 isn't promising guaranteed sync either, to my eyes.


Given Postgres' track record of delivering what it promises, I expect
casual readers to take that phrase as a definitive guide to what is
happening internally.



 Perhaps we could compromise on, say, changing so that concurrent scans
 read the same block at about the same time to so that concurrent scans
 tend to read the same block at about the same time,


Given that, on first read the word about did not deter me from assuming
the best, I don't think adding tend would make much difference in a
readers (mis)understanding. Perhaps we can spare a few more words to make
more clear.


 or something like
 that.  I don't mind making it sound a bit more uncertain, but I don't
 think that we need to emphasize the probability of failure.


I agree we don't want to stress the failure case too much, especially when
it makes the performance no worse than the absence of the feature. But we
don't want the reader to get the wrong idea either.

In addition to the slight doc improvement being suggested, perhaps a
wiki.postgresql.org entry would allow us to explain the behaviour in more
detail.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


[HACKERS] synchronize_seqscans' description is a bit misleading

2013-04-10 Thread Gurjeet Singh
If I'm reading the code right [1], this GUC does not actually *synchronize*
the scans, but instead just makes sure that a new scan starts from a block
that was reported by some other backend performing a scan on the same
relation.

Since the backends scanning the relation may be processing the relation at
different speeds, even though each one took the hint when starting the
scan, they may end up being out of sync with each other. Even in a single
query, there may be different scan nodes scanning different parts of the
same relation, and even they don't synchronize with each other (and for
good reason).

Imagining that all scans on a table are always synchronized, may make some
wrongly believe that adding more backends scanning the same table will not
incur any extra I/O; that is, only one stream of blocks will be read from
disk no matter how many backends you add to the mix. I noticed this when I
was creating partition tables, and each of those was a CREATE TABLE AS
SELECT FROM original_table (to avoid WAL generation), and running more than
3 such transactions caused the disk read throughput to behave unpredictably,
sometimes even dipping below 1 MB/s for a few seconds at a stretch.

Please note that I am not complaining about the implementation, which I
think is the best we can do without making backends wait for each other.
It's just that the documentation [2] implies that the scans are
synchronized through the entire run, which is clearly not the case. So I'd
like the docs to be improved to reflect that.

How about something like:

doc
synchronize_seqscans (boolean)
This allows sequential scans of large tables to start from a point in
the table that is already being read by another backend. This increases the
probability that concurrent scans read the same block at about the same
time and hence share the I/O workload. Note that, due to the difference in
speeds of processing the table, the backends may eventually get out of
sync, and hence stop sharing the I/O workload.

When this is enabled, ... The default is on.
/doc

Best regards,

[1] src/backend/access/heap/heapam.c
[2]
http://www.postgresql.org/docs/9.2/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading

2013-04-10 Thread Gurjeet Singh
On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh gurj...@singh.im writes:
  If I'm reading the code right [1], this GUC does not actually
 *synchronize*
  the scans, but instead just makes sure that a new scan starts from a
 block
  that was reported by some other backend performing a scan on the same
  relation.

 Well, that's the only *direct* effect, but ...

  Since the backends scanning the relation may be processing the relation
 at
  different speeds, even though each one took the hint when starting the
  scan, they may end up being out of sync with each other.

 The point you're missing is that the synchronization is self-enforcing:
 whichever backend gets ahead of the others will be the one forced to
 request (and wait for) the next physical I/O.  This will naturally slow
 down the lower-CPU-cost-per-page scans.  The other ones tend to catch up
 during the I/O operation.


Got it. So far, so good.

Let's consider a pathological case where a scan is performed by a user
controlled cursor, whose scan speed depends on how fast the user presses
the Next button, then this scan is quickly going to fall out of sync with
other scans. Moreover, if a new scan happens to pick up the block reported
by this slow scan, then that new scan may have to read blocks off the disk
afresh.

So, again, it is not guaranteed that all the scans on a relation will
synchronize with each other. Hence my proposal to include the term
'probability' in the definition.


 The feature is not terribly useful unless I/O costs are high compared to
 the CPU cost-per-page.  But when that is true, it's actually rather
 robust.  Backends don't have to have exactly the same per-page
 processing cost, because pages stay in shared buffers for a while after
 the current scan leader reads them.


Agreed. Even if the buffer has been evicted from shared_buffers, there's a
high likelihood that the scan that's close on the heels of others will
fetch it from FS cache.



  Imagining that all scans on a table are always synchronized, may make
 some
  wrongly believe that adding more backends scanning the same table will
 not
  incur any extra I/O; that is, only one stream of blocks will be read from
  disk no matter how many backends you add to the mix. I noticed this when
 I
  was creating partition tables, and each of those was a CREATE TABLE AS
  SELECT FROM original_table (to avoid WAL generation), and running more
 than
  3 such transactions caused the disk read throughput to behave
 unpredictably,
  sometimes even dipping below 1 MB/s for a few seconds at a stretch.

 It's not really the scans that's causing that to be unpredictable, it's
 the write I/O from the output side, which is forcing highly
 nonsequential behavior (or at least I suspect so ... how many disk units
 were involved in this test?)


You may be right. I don't have access to the system anymore, and I don't
remember the disk layout, but it's quite possible that write operations
were causing the  read throughput to drop. I did try to reproduce the
behaviour on my laptop with up to 6 backends doing pure reads on a table
that was multiple times the system RAM, but I could not get them to get out
of sync.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


[HACKERS] Patch to make pgindent work cleanly

2013-02-19 Thread Gurjeet Singh
Please find attached the patch for some cleanup and fix bit rot in pgindent
script.

There were a few problems with the script.

.) It failed to use the $ENV{PGENTAB} even if it was set.
.) The file it tries to download from Postgres' ftp site is no longer
present.
ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz
.) The tar command extracted the above-mentioned file to a child directory,
but did not descend into it before running make on it.
I think it expected a tarbomb, but clearly the current .tgz file on ftp
site is not a tarbomb.

I don't like the fact that it dies with a message fetching xyz rather
than saying Could not fetch xyz, but this patch does not address that
since it doesn't really affect the output when script does work.

With this patch in place I could very easily run it on any arbitrary file,
which seemed like a black-magic before the patch.

src/tools/pgindent/pgindent --build your file path here

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterprsieDB Inc.


pgindent.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] One-line comment to improve understanding of VARSIZE_ANY_EXHDR macro

2013-02-19 Thread Gurjeet Singh
Hopefully I am not wrong.


+/* Size of a varlena data, excluding header */
 #define VARSIZE_ANY_EXHDR(PTR) \


-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterprsieDB Inc.


exhdr_comment.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fwd: Successful post to pgsql-hackers

2013-02-10 Thread Gurjeet Singh
On Sat, Feb 9, 2013 at 5:09 PM, Euler Taveira eu...@timbira.com wrote:

 On 09-02-2013 13:45, Gurjeet Singh wrote:
  BTW, I hope I understand what selfcopy is: send a copy to yourself. Why
 would
  that be turned on by default?
 
 If you want to reply to yourself...


Wouldn't I be using the copy in my sent folder to do that?

I guess it is a non-issue for me since my mail provider (GMail) does not
show me the mail I sent to myself as a duplicate; it at least seems that
way.

-- 
Gurjeet Singh

http://gurjeet.singh.im/


Re: [HACKERS] pg_prewarm

2013-02-09 Thread Gurjeet Singh
On Sun, Jun 24, 2012 at 1:36 PM, Cédric Villemain ced...@2ndquadrant.comwrote:

 Le samedi 23 juin 2012 02:47:15, Josh Berkus a écrit :
   The biggest problem with pgfincore from my point of view is that it
   only works under Linux, whereas I use a MacOS X machine for my
   development, and there is also Windows to think about.  Even if that
   were fixed, though, I feel we ought to have something in the core
   distribution.  This patch got more +1s than 95% of what gets proposed
   on hackers.
 
  Fincore is only a blocker to this patch if we think pgfincore is ready
  to be proposed for the core distribution.  Do we?

 I'll make it ready for. (not a huge task).


Hi Cedric,

Can you please post the progress on this, if any.

I am planning on polishing up pg_prewarm based on the reviews. As
others have said, I don't see a reason why both can't coexist, maybe in
pgxn. I am all ears if you think otherwise.

Best regards,
-- 
Gurjeet Singh

http://gurjeet.singh.im/


[HACKERS] Fwd: Successful post to pgsql-hackers

2013-02-09 Thread Gurjeet Singh
Recently I have started getting these confirmations for every email I send
to the mailing lists. I think it's related to the fact that I recently
switched to using a new email address.

How can I turn these notifications off?

-- Forwarded message --
From: pgsql-hackers-ow...@postgresql.org
Date: Sat, Feb 9, 2013 at 10:13 AM
Subject: Successful post to pgsql-hackers
To: Gurjeet Singh gurj...@singh.im


Your message to the pgsql-hackers list, posted on
  Sat, 9 Feb 2013 10:11:05 -0500

with subject
  Re: pg_prewarm

is currently being delivered.



-- 
Gurjeet Singh

http://gurjeet.singh.im/


Re: [HACKERS] Fwd: Successful post to pgsql-hackers

2013-02-09 Thread Gurjeet Singh
I tried the lists page and disabled the 'selfcopy'  since the 'ackpot' was
kinda hidden under the mailing list names. I have updated the settings
again to disable ackpost, and enable the selfcopy. This email will be test
if that worked.

BTW, I hope I understand what selfcopy is: send a copy to yourself. Why
would that be turned on by default?

On Sat, Feb 9, 2013 at 10:24 AM, Magnus Hagander mag...@hagander.netwrote:

 It's in your personal majordomo settings. I don't think it's related
 to that at all, but it's a flag you set on your subscription, so
 perhaps you set it by mistake.

 You shold be able to set it from https://mail.postgresql.org. The
 setting you're looking for is ackpost, and you'll want to turn it
 off.

 //Magnus


 On Sat, Feb 9, 2013 at 4:17 PM, Gurjeet Singh gurj...@singh.im wrote:
  Recently I have started getting these confirmations for every email I
 send
  to the mailing lists. I think it's related to the fact that I recently
  switched to using a new email address.
 
  How can I turn these notifications off?
 
  -- Forwarded message --
  From: pgsql-hackers-ow...@postgresql.org
  Date: Sat, Feb 9, 2013 at 10:13 AM
  Subject: Successful post to pgsql-hackers
  To: Gurjeet Singh gurj...@singh.im
 
 
  Your message to the pgsql-hackers list, posted on
Sat, 9 Feb 2013 10:11:05 -0500
 
  with subject
Re: pg_prewarm
 
  is currently being delivered.
 
 
 
  --
  Gurjeet Singh
 
  http://gurjeet.singh.im/



 --
  Magnus Hagander
  Me: http://www.hagander.net/
  Work: http://www.redpill-linpro.com/




-- 
Gurjeet Singh

http://gurjeet.singh.im/


Re: [HACKERS] USE_PGXS contrib build is broken

2013-02-04 Thread Gurjeet Singh
On Mon, Feb 4, 2013 at 2:38 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote:

 I just noticed that contrib programs don't build in USE_PGXS=1
 environment:

 $ pwd
 /pgsql/build/HEAD/contrib/pg_upgrade
 $ LC_ALL=C USE_PGXS=1 make
 make: *** No rule to make target `check.o', needed by `pg_upgrade'.  Stop.


FWIW, I am able to build just fine using USE_PGXS!


Re: [HACKERS] USE_PGXS contrib build is broken

2013-02-04 Thread Gurjeet Singh
On Mon, Feb 4, 2013 at 3:37 PM, Alvaro Herrera alvhe...@2ndquadrant.comwrote:

 Gurjeet Singh wrote:
  On Mon, Feb 4, 2013 at 2:38 PM, Alvaro Herrera alvhe...@2ndquadrant.com
 wrote:
 
   I just noticed that contrib programs don't build in USE_PGXS=1
   environment:
  
   $ pwd
   /pgsql/build/HEAD/contrib/pg_upgrade
   $ LC_ALL=C USE_PGXS=1 make
   make: *** No rule to make target `check.o', needed by `pg_upgrade'.
  Stop.
 
  FWIW, I am able to build just fine using USE_PGXS!

 Hmm.  Mine is a VPATH build; maybe that makes a difference.  Now I'm not
 really certain that pgxs is supposed to work for VPATH.


Yup, looks like that. When I tried my (default) VPATH build, it did throw
that error! But a regular build went just fine.

My VPATH build that breaks:

USE_PGXS=1 pgmake -C contrib/pg_upgrade

Which gets translated into

USE_PGXS=1 make -C /home/gurjeet/dev/pgdbuilds/pg_master -C
contrib/pg_upgrade


Re: [HACKERS] [PERFORM] pgbench to the MAXINT

2013-01-28 Thread Gurjeet Singh
On Sat, Jan 26, 2013 at 11:24 PM, Satoshi Nagayasu sn...@uptime.jp wrote:

 Hi,

 I have reviewed this patch.

 https://commitfest.postgresql.org/action/patch_view?id=1068

 2012/12/21 Gurjeet Singh singh.gurj...@gmail.com:
  The patch is very much what you had posted, except for a couple of
  differences due to bit-rot. (i) I didn't have to #define
 MAX_RANDOM_VALUE64
  since its cousin MAX_RANDOM_VALUE is not used by code anymore, and (ii) I
  used ternary operator in DDLs[] array to decide when to use bigint vs int
  columns.
 
  Please review.
 
  As for tests, I am currently running 'pgbench -i -s 21474' using
  unpatched pgbench, and am recording the time taken;Scale factor 21475 had
  actually failed to do anything meaningful using unpatched pgbench. Next
 I'll
  run with '-s 21475' on patched version to see if it does the right thing,
  and in acceptable time compared to '-s 21474'.
 
  What tests would you and others like to see, to get some confidence
 in
  the patch? The machine that I have access to has 62 GB RAM, 16-core
  64-hw-threads, and about 900 GB of disk space.

 I have tested this patch, and hvae confirmed that the columns
 for aid would be switched to using bigint, instead of int,
 when the scalefactor = 20,000.
 (aid columns would exeed the upper bound of int when sf21474.)

 Also, I added a few fixes on it.

 - Fixed to apply for the current git master.
 - Fixed to surpress few more warnings about INT64_FORMAT.
 - Minor improvement in the docs. (just my suggestion)

 I attached the revised one.


Looks good to me. Thanks!

-- 
Gurjeet Singh

http://gurjeet.singh.im/


  1   2   3   4   5   >