Re: [HACKERS] postgresql.auto.conf and reload

2014-08-10 Thread Amit Kapila
On Fri, Aug 8, 2014 at 11:41 AM, Fujii Masao masao.fu...@gmail.com wrote:

 Yep, right. ParseConfigFp() is not good place to pick up data_directory.
 What about the attached patch which makes ProcessConfigFile() instead of
 ParseConfigFp() pick up data_directory just after the configuration file
is
 parsed?

I think this is better way to handle it. Few comments about patch:

1. can't we directly have the code by having else in below loop.
if (DataDir 
!ParseConfigFile(PG_AUTOCONF_FILENAME, NULL, false, 0, elevel,
 head, tail))

2.
+ if (DataDir == NULL)
+ {
+ ConfigVariable *prev = NULL;
+ for (item = head; item;)
+ {
..
..
+ }

If data_directory is not present in list, then can't we directly return
and leave the other work in function ProcessConfigFile() for second
pass.

3.
I think comments can be improved:
a.
+ In this case,
+  * we should not pick up all the settings except the data_directory
+  * from postgresql.conf yet because they might be overwritten
+  * with the settings in PG_AUTOCONF_FILENAME file which will be
+  * read later.

It would be better if we change it as below:

In this case, we shouldn't pick any settings except the data_directory
from postgresql.conf because they might be overwritten
with the settings in PG_AUTOCONF_FILENAME file which will be
read later.

b.
+Now only the data_directory needs to be picked up to
+  * read PG_AUTOCONF_FILENAME file later.

This part of comment seems to be repetitive as you already mentioned
some part of it in first line as well as at one other location:

+  * Pick up only the data_directory if DataDir is not set,

+ /*
+  * Read the configuration file for the first time. This time only
+  * data_directory parameter is picked up to determine the data directory
+  * so that we can read PG_AUTOCONF_FILENAME file next time.

Please see if you can improve it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] postgresql.auto.conf and reload

2014-08-10 Thread Amit Kapila
On Sun, Aug 10, 2014 at 12:24 PM, Amit Kapila amit.kapil...@gmail.com
wrote:

 On Fri, Aug 8, 2014 at 11:41 AM, Fujii Masao masao.fu...@gmail.com
wrote:
 
  Yep, right. ParseConfigFp() is not good place to pick up data_directory.
  What about the attached patch which makes ProcessConfigFile() instead of
  ParseConfigFp() pick up data_directory just after the configuration
file is
  parsed?

 I think this is better way to handle it. Few comments about patch:

 1. can't we directly have the code by having else in below loop.
 if (DataDir 
 !ParseConfigFile(PG_AUTOCONF_FILENAME, NULL, false, 0, elevel,
 head, tail))

I think for this we need to change the above condition a bit.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reporting the commit LSN at commit time

2014-08-10 Thread Andres Freund
On 2014-08-10 08:50:58 +0800, Craig Ringer wrote:
 On 08/10/2014 12:54 AM, Andres Freund wrote:
  On 2014-08-07 21:02:54 -0400, Tom Lane wrote:
  Craig Ringer cr...@2ndquadrant.com writes:
  On 08/08/2014 03:54 AM, Tom Lane wrote:
  FWIW, I think it's a seriously bad idea to expose LSNs in the protocol
  at all.   What happens five years from now when we switch to some other
  implementation that doesn't have LSNs?
 
  Everyone who's relying on them already via pg_stat_replication, etc, 
  breaks.
  They're _already_ exposed to users. That ship has sailed.
 
  They're exposed to replication tools, yeah, but embedding them in the
  wire protocol would be moving the goalposts a long way past that.  As an
  example of something that doubtless seemed like a good idea at the time,
  consider the business about how an INSERT command completion tag includes
  the OID of the inserted row.  We're stuck with that obsolete idea
  *forever* because it's embedded in the protocol for all clients.
  
  I don't think we really need to embed it at that level. And it doesn't
  have to be always on - only clients that ask for it need to get the
  answer. Something like COMMIT WITH (report_commit_lsn ON); or similar
  might do the trick?
 
 Wouldn't that force client drivers - libpq, psqlODBC, PgJDBC, etc - to
 all watch for explicit COMMITs sent by the application and rewrite them?

Any application doing such transparent failover would need to have a
driver that's aware of all that anyway. They need to learn about the
transaction boundaries, the commit command and such. I personally think
this should mean that that feature requires an explicit API call for
transaction control.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-10 Thread Simon Riggs
On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 1. MMTuple contains the block number of the heap page (range) that the tuple
 represents. Vacuum is no longer needed to clean up old tuples; when an index
 tuples is updated, the old tuple is deleted atomically with the insertion of
 a new tuple and updating the revmap, so no garbage is left behind.

What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parametric block size?

2014-08-10 Thread Fabien COELHO


Hello Andres,


But further benchmarks sound like a good idea.


I've started running some benchmarks with pgbench, with varying block  
WAL block sizes. I've done a blog post on a small subset of results, 
focussing on block size with SSDs and to validate the significance of the 
figures found, see for more details: 
http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD/


I've also found an old post by Tomas Vondra who did really extensive 
tests, including playing around with file system options: 
http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/


The cumulated and consistent result of all these tests, including 
Hans-Jürgen Schönig short tests, is that reducing page size on SSDs 
increases significantly pgbench reported performance, by about 10%.


I've also done some tests with HDDs which are quite disappointing, with 
PostgreSQL running in batch mode: a few seconds at 1000 tps followed by a 
catch-up phase of 20 seconds at about 0 (zero) tps, and back to a new 
cycle. I'm not sure of which parameter to tweak (postgresql configuration, 
linux io scheduler, ext4 options or possibly stay away from ext4) to get 
something more stable.


--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-10 Thread Simon Riggs
On 8 August 2014 10:01, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 It's possible that two backends arrive at phase 3 at the same time, with
 different values. For example, backend A wants to update the minimum to
 contain 10, and and backend B wants to update it to 5. Now, if backend B
 gets to update the tuple first, to 5, backend A will update the tuple to 10
 when it gets the lock, which is wrong.

 The simplest solution would be to get the buffer lock in exclusive mode to
 begin with, so that you don't need to release it between steps 2 and 5. That
 might be a significant hit on concurrency, though, when most of the
 insertions don't in fact have to update the value. Another idea is to
 re-check the updated values after acquiring the lock in exclusive mode, to
 see if they match the previous values.

Simplest solution is to re-apply the test just before update, so in
the above example, if we think we want to lower the minimum to 10 and
when we get there it is already 5, we just don't update.

We don't need to do the re-check always, though. We can read the page
LSN while holding share lock, then re-read it once we acquire
exclusive lock. If LSN is the same, no need for datatype specific
re-checks at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-10 Thread Simon Riggs
On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 I couldn't resist starting to hack on this, and implemented the scheme I've
 been having in mind:

 1. MMTuple contains the block number of the heap page (range) that the tuple
 represents. Vacuum is no longer needed to clean up old tuples; when an index
 tuples is updated, the old tuple is deleted atomically with the insertion of
 a new tuple and updating the revmap, so no garbage is left behind.

 2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
 block number is used to check that you land on the right tuple. If not, the
 search is started over, looking at the revmap again.

Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.

Part 1 doesn't sound good yet.
Are they connected?

More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-10 Thread Heikki Linnakangas

On 08/10/2014 12:22 PM, Simon Riggs wrote:

On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote:


1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.


What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?


It's no different from Alvaro's patch. The updated MMTuple covers the 
aborted value, but that's OK from a correctnes point of view.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-10 Thread Heikki Linnakangas

On 08/10/2014 12:42 PM, Simon Riggs wrote:

On 8 August 2014 16:03, Heikki Linnakangas hlinnakan...@vmware.com wrote:


I couldn't resist starting to hack on this, and implemented the scheme I've
been having in mind:

1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.

2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
block number is used to check that you land on the right tuple. If not, the
search is started over, looking at the revmap again.


Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.

Part 1 doesn't sound good yet.
Are they connected?


Yes. The optimistic locking in part 2 is based on checking that the 
block number on the MMTuple matches what you're searching for, and that 
there is never more than one MMTuple in the index with the same block 
number.



More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?


Well, I wouldn't consider let's redesign how locking and vacuuming 
works and change the on-disk format as incremental development ;-). 
It's more like, well, redesigning the whole thing. Any testing and 
tuning would certainly need to be redone after such big changes.


If you agree that these changes make sense, let's do them now and not 
waste people's time testing and tuning a dead-end design. If you don't 
agree, then let's discuss that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] HINT: pg_hba.conf changed since last config reload

2014-08-10 Thread Craig Ringer
Hi all

I just had an idea I wanted to run by you all before turning it into a
patch.

People seem to get confused when they get auth errors because they
changed pg_hba.conf but didn't reload.

Should we emit a HINT alongside the main auth error in that case?

Given the amount of confusion that I see around pg_hba.conf from new
users, I figure anything that makes it less confusing might be a good
thing if there aren't other consequences.

Interested in a patch?
-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to support SEMI and ANTI join removal

2014-08-10 Thread David Rowley
On Tue, Aug 5, 2014 at 10:35 PM, David Rowley dgrowle...@gmail.com wrote:


 The patch (attached) is also now able to detect when a NOT EXISTS clause
 cannot produce any records at all.


 I've attached an updated version of the patch which fixes up some
incorrect logic in the foreign key matching code, plus various comment
improvements.

Regards

David Rowley


semianti_join_removal_f92541e_2014-08-10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HINT: pg_hba.conf changed since last config reload

2014-08-10 Thread Andres Freund
Hi,

On 2014-08-10 19:48:29 +0800, Craig Ringer wrote:
 I just had an idea I wanted to run by you all before turning it into a
 patch.
 
 People seem to get confused when they get auth errors because they
 changed pg_hba.conf but didn't reload.
 
 Should we emit a HINT alongside the main auth error in that case?
 
 Given the amount of confusion that I see around pg_hba.conf from new
 users, I figure anything that makes it less confusing might be a good
 thing if there aren't other consequences.

I think we could/would only emit that to the server log because of
security concerns. It very well might be interesting for an attacker to
know that an outdated hba.conf is still being used... Would that still
provide enough benefits?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of versioning on Windows, take two

2014-08-10 Thread MauMau

From: Michael Paquier michael.paqu...@gmail.com

Please find attached a patch finishing the work begun during CF1. This
adds versioning support for all the remaining dll and exe files on
both MinGW and MSVC:
- regress.dll (MSVC only)
- isolationtester.exe
- pg_isolation_regress.exe
- pg_regress.exe
- pg_regress_ecpg.exe
- zic.exe
I will add this patch to CF2. Comments are welcome.


The patch applied cleanly to the latest source code.  But the build failed 
with MSVC 2008 Express due to the exact same LNK1104 error mentioned in:


http://www.postgresql.org/message-id/cab7npqrty1eikqgmz1_wbvhpvfyve9vma67ricepq6d8-eg...@mail.gmail.com

 LINK : fatal error LNK1104: ファイル 
'.\release\postgres\src_timezone_win32ver.obj' を開くことができません。


Regards
MauMau





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HINT: pg_hba.conf changed since last config reload

2014-08-10 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2014-08-10 19:48:29 +0800, Craig Ringer wrote:
  I just had an idea I wanted to run by you all before turning it into a
  patch.
  
  People seem to get confused when they get auth errors because they
  changed pg_hba.conf but didn't reload.
  
  Should we emit a HINT alongside the main auth error in that case?
  
  Given the amount of confusion that I see around pg_hba.conf from new
  users, I figure anything that makes it less confusing might be a good
  thing if there aren't other consequences.
 
 I think we could/would only emit that to the server log because of
 security concerns. It very well might be interesting for an attacker to
 know that an outdated hba.conf is still being used... Would that still
 provide enough benefits?

I'd think that'd be useful even if it's only in the main log.

To Craig's point on addressing user confusion (when the user is really
an admin trying to work through changes), a HINT along the lines of
this incident has been logged with details to the PostgreSQL log file
or something might help.  It amazes me how often just telling people to
go *look at the server log file* helps...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] PostgreSQL vs oracle doing 1 million sqrts am I doing it wrong?

2014-08-10 Thread Pavel Stehule
Hi


2014-08-09 10:20 GMT+02:00 Guillaume Lelarge guilla...@lelarge.info:

 Hi,

 Le 9 août 2014 05:57, Ramirez, Danilo danilo.rami...@hmhco.com a
 écrit :
 
  Thanks to all for the great info.  We are new to postgresql and this
 discussion has both instructed us and increased our respect for the
 database and the community.
 
  I am seeing a behavior that I don’t understand and hopefully you guys
 can clear it up.
 
  I am using AWS postgresql db.m3.2xlarge and using pgadmin III 1.18
 comparing against AWS oracle on db.m3.2xlarge using sql developer and TOAD.
 
  I am running a query with 30 tables in the from clause, getting 137
 columns back (this is our most basic query, they get a lot more more
 complex).   It returns back 4800 rows.
 
  In oracle 1st run takes 3.92 seconds, 2nd .38 seconds.  Scrolling to end
 takes and extra 1.5 seconds for total of 5.5.
 
  Using pgadmin, I run the query.  Looking at the lower right hand I can
 see the time going up.  It stops at 8200 ms or close to it every time, then
 it takes an extra 6 seconds before it displays the rows on the screen.
  2nd, 3rd, etc. runs all take about  same amount of time 8 sec plus 6 sec
 
  I then changed it to return only 1 column back.   In oracle/sqldeveloper
 identical behavior as before, same time.  In postgresql it now goes down to
 1.8 seconds for 1st, 2nd, etc. runs.
 
  I then change it so that I am asking for the sum of 1 column.  In oracle
 time goes down to .2 seconds and postgresql now goes down to .2 seconds
 also.
 
  I then change it back to get the full result set and behavior goes back
 to original, oracle .38 since its cached, postgresql 8 seconds.
 

 Are you sure this is postgresql 8 seconds? I'd believe this is more
 something like postgresql something really low and PgAdmin around 8 seconds
 displaying it. What I mean is, PgAdmin uses really slow UI components and
 the time it shows is the time to execute the query and display the data.
 IOW, you shouldn't use it to benchmark. You should better use psql. Or,
 much better, you should set log_min_duration_statement to 0 and see exactly
 how much time postgresql needs to execute it.

yes, try to eliminate a impact of PgAdmin

for this purpose use psql

\timing
\o /dev/null
SELECT ... -- your query

Regards

Pavel

p.s. you can send a plans of slow and fast variants.


Re: [HACKERS] Support for N synchronous standby servers

2014-08-10 Thread Fujii Masao
On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 Hi all,

 Please find attached a patch to add support of synchronous replication
 for multiple standby servers. This is controlled by the addition of a
 new GUC parameter called synchronous_standby_num, that makes server
 wait for transaction commit on the first N standbys defined in
 synchronous_standby_names. The implementation is really
 straight-forward, and has just needed a couple of modifications in
 walsender.c for pg_stat_get_wal_senders and syncrep.c.

Great! This is really the feature which I really want.
Though I forgot why we missed this feature when
we had added the synchronous replication feature,
maybe it's worth reading the old discussion which
may suggest the potential problem of N sync standbys.

I just tested this feature with synchronous_standby_num = 2.
I started up only one synchronous standby and ran
the write transaction. Then the transaction was successfully
completed, i.e., it didn't wait for two standbys. Probably
this is a bug of the patch.

And, you forgot to add the line of synchronous_standby_num
to postgresql.conf.sample.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minmax indexes

2014-08-10 Thread Claudio Freire
On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 It's possible that two backends arrive at phase 3 at the same time, with
 different values. For example, backend A wants to update the minimum to
 contain 10, and and backend B wants to update it to 5. Now, if backend B
 gets to update the tuple first, to 5, backend A will update the tuple to 10
 when it gets the lock, which is wrong.

 The simplest solution would be to get the buffer lock in exclusive mode to
 begin with, so that you don't need to release it between steps 2 and 5. That
 might be a significant hit on concurrency, though, when most of the
 insertions don't in fact have to update the value. Another idea is to
 re-check the updated values after acquiring the lock in exclusive mode, to
 see if they match the previous values.


No, the simplest solution is to re-check the bounds after acquiring
the exclusive lock. So instead of doing the addValue with share lock,
do a consistency check first, and if it's not consistent, do the
addValue with exclusive lock.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-08-10 Thread Jeff Davis
On Fri, 2014-08-08 at 01:16 -0700, Jeff Davis wrote:
 Either way, it's better to be conservative. Attached is a version of the
 patch with opt-in memory usage tracking. Child contexts inherit the
 setting from their parent.

There was a problem with the previous patch: the parent was unlinked
before the delete_context method was called, which meant that the
parent's memory was not being decremented.

Attached is a fix. It would be simpler to just reorder the operations in
MemoryContextDelete, but there is a comment warning against that. So, I
pass the old parent as an argument to the delete_context method.

Regards,
Jeff Davis

*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 242,247  typedef struct AllocChunkData
--- 242,249 
  #define AllocChunkGetPointer(chk)	\
  	((AllocPointer)(((char *)(chk)) + ALLOC_CHUNKHDRSZ))
  
+ static void update_allocation(MemoryContext context, int64 size);
+ 
  /*
   * These functions implement the MemoryContext API for AllocSet contexts.
   */
***
*** 250,256  static void AllocSetFree(MemoryContext context, void *pointer);
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
--- 252,258 
  static void *AllocSetRealloc(MemoryContext context, void *pointer, Size size);
  static void AllocSetInit(MemoryContext context);
  static void AllocSetReset(MemoryContext context);
! static void AllocSetDelete(MemoryContext context, MemoryContext parent);
  static Size AllocSetGetChunkSpace(MemoryContext context, void *pointer);
  static bool AllocSetIsEmpty(MemoryContext context);
  static void AllocSetStats(MemoryContext context, int level);
***
*** 430,435  randomize_mem(char *ptr, size_t size)
--- 432,440 
   * minContextSize: minimum context size
   * initBlockSize: initial allocation block size
   * maxBlockSize: maximum allocation block size
+  *
+  * The flag determining whether this context tracks memory usage is inherited
+  * from the parent context.
   */
  MemoryContext
  AllocSetContextCreate(MemoryContext parent,
***
*** 438,443  AllocSetContextCreate(MemoryContext parent,
--- 443,468 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
+ 	return AllocSetContextCreateTracked(
+ 		parent, name, minContextSize, initBlockSize, maxBlockSize,
+ 		(parent == NULL) ? false : parent-track_mem);
+ }
+ 
+ /*
+  * AllocSetContextCreateTracked
+  *		Create a new AllocSet context.
+  *
+  * Implementation for AllocSetContextCreate, but also allows the caller to
+  * specify whether memory usage should be tracked or not.
+  */
+ MemoryContext
+ AllocSetContextCreateTracked(MemoryContext parent,
+ 			 const char *name,
+ 			 Size minContextSize,
+ 			 Size initBlockSize,
+ 			 Size maxBlockSize,
+ 			 bool track_mem)
+ {
  	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
***
*** 445,451  AllocSetContextCreate(MemoryContext parent,
  			 sizeof(AllocSetContext),
  			 AllocSetMethods,
  			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 470,477 
  			 sizeof(AllocSetContext),
  			 AllocSetMethods,
  			 parent,
! 			 name,
! 			 track_mem);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 500,505  AllocSetContextCreate(MemoryContext parent,
--- 526,534 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
+ 
+ 		update_allocation((MemoryContext) context, blksize);
+ 
  		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 619,625 
  		else
  		{
  			/* Normal case, release the block */
+ 			update_allocation(context, -(block-endptr - ((char*) block)));
  #ifdef CLOBBER_FREED_MEMORY
  			wipe_mem(block, block-freeptr - ((char *) block));
  #endif
***
*** 611,617  AllocSetReset(MemoryContext context)
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = set-blocks;
--- 641,647 
   * But note we are not responsible for deleting the context node itself.
   */
  static void
! AllocSetDelete(MemoryContext context, MemoryContext parent)
  {
  	AllocSet	set = (AllocSet) context;
  	AllocBlock	block = 

[HACKERS] 9.5: Memory-bounded HashAgg

2014-08-10 Thread Jeff Davis
This patch is requires the Memory Accounting patch, or something similar
to track memory usage.

The attached patch enables hashagg to spill to disk, which means that
hashagg will contain itself to work_mem even if the planner makes a
bad misestimate of the cardinality.

This is a well-known concept; there's even a Berkeley homework
assignment floating around to implement it -- in postgres 7.2, no
less. I didn't take the exact same approach as the homework assignment
suggests, but it's not much different, either. My apologies if some
classes are still using this as a homework assignment, but postgres
needs to eventually have an answer to this problem.

Included is a GUC, enable_hashagg_disk (default on), which allows
the planner to choose hashagg even if it doesn't expect the hashtable
to fit in memory. If it's off, and the planner misestimates the
cardinality, hashagg will still use the disk to contain itself to
work_mem.

One situation that might surprise the user is if work_mem is set too
low, and the user is *relying* on a misestimate to pick hashagg. With
this patch, it would end up going to disk, which might be
significantly slower. The solution for the user is to increase
work_mem.

Rough Design:

Change the hash aggregate algorithm to accept a generic work item,
which consists of an input file as well as some other bookkeeping
information.

Initially prime the algorithm by adding a single work item where the
file is NULL, indicating that it should read from the outer plan.

If the memory is exhausted during execution of a work item, then
continue to allow existing groups to be aggregated, but do not allow new
groups to be created in the hash table. Tuples representing new groups
are saved in an output partition file referenced in the work item that
is currently being executed.

When the work item is done, emit any groups in the hash table, clear the
hash table, and turn each output partition file into a new work item.

Each time through at least some groups are able to stay in the hash
table, so eventually none will need to be saved in output partitions, no
new work items will be created, and the algorithm will terminate. This
is true even if the number of output partitions is always one.

Open items:
   * costing
   * EXPLAIN details for disk usage
   * choose number of partitions intelligently
   * performance testing

Initial tests indicate that it can be competitive with sort+groupagg
when the disk is involved, but more testing is required.

Feedback welcome.

Regards,
Jeff Davis
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2884,2889  include_dir 'conf.d'
--- 2884,2904 
/listitem
   /varlistentry
  
+  varlistentry id=guc-enable-hashagg-disk xreflabel=enable_hashagg_disk
+   termvarnameenable_hashagg_disk/varname (typeboolean/type)
+   indexterm
+primaryvarnameenable_hashagg_disk/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Enables or disables the query planner's use of hashed aggregation plan
+ types when the planner expects the hash table size to exceed
+ varnamework_mem/varname. The default is literalon/.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-enable-hashjoin xreflabel=enable_hashjoin
termvarnameenable_hashjoin/varname (typeboolean/type)
indexterm
*** a/src/backend/executor/execGrouping.c
--- b/src/backend/executor/execGrouping.c
***
*** 331,336  TupleHashEntry
--- 331,385 
  LookupTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
  	 bool *isnew)
  {
+ 	uint32 hashvalue;
+ 
+ 	hashvalue = TupleHashEntryHash(hashtable, slot);
+ 	return LookupTupleHashEntryHash(hashtable, slot, hashvalue, isnew);
+ }
+ 
+ /*
+  * TupleHashEntryHash
+  *
+  * Calculate the hash value of the tuple.
+  */
+ uint32
+ TupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot)
+ {
+ 	TupleHashEntryData	dummy;
+ 	TupleHashTable		saveCurHT;
+ 	uint32hashvalue;
+ 
+ 	/*
+ 	 * Set up data needed by hash function.
+ 	 *
+ 	 * We save and restore CurTupleHashTable just in case someone manages to
+ 	 * invoke this code re-entrantly.
+ 	 */
+ 	hashtable-inputslot = slot;
+ 	hashtable-in_hash_funcs = hashtable-tab_hash_funcs;
+ 	hashtable-cur_eq_funcs = hashtable-tab_eq_funcs;
+ 
+ 	saveCurHT = CurTupleHashTable;
+ 	CurTupleHashTable = hashtable;
+ 
+ 	dummy.firstTuple = NULL;	/* flag to reference inputslot */
+ 	hashvalue = TupleHashTableHash(dummy, sizeof(TupleHashEntryData));
+ 
+ 	CurTupleHashTable = saveCurHT;
+ 
+ 	return hashvalue;
+ }
+ 
+ /*
+  * LookupTupleHashEntryHash
+  *
+  * Like LookupTupleHashEntry, but allows the caller to specify the tuple's
+  * hash value, to avoid recalculating it.
+  */
+ TupleHashEntry
+ LookupTupleHashEntryHash(TupleHashTable hashtable, TupleTableSlot *slot,
+ 		 uint32 hashvalue, bool *isnew)
+ {
  	

Re: [HACKERS] Partitioning performance: cache stringToNode() of pg_constraint.ccbin

2014-08-10 Thread Noah Misch
On Thu, Aug 07, 2014 at 09:39:57AM -0400, Robert Haas wrote:
 On Wed, Aug 6, 2014 at 9:35 PM, Bruce Momjian br...@momjian.us wrote:
  On Sun, Jan 12, 2014 at 12:53:40PM -0500, Noah Misch wrote:
  Further study revealed a defect in the patch's memory management, and I 
  have
  not gotten around to correcting that.
 
  I talked to Noah and he can't continue on this item.  Can someone else
  work on it?
 
 Well, it would be helpful if he could describe the defect he found, so
 that the next person doesn't have to guess.

Quite right.  Primarily, a leak: FreeTupleDesc() did a mere pfree() on a node
tree, leaking all but the root node.  Perhaps the best fix there is to give
each TupleDesc a memory context and then simplify FreeTupleDesc() to a context
deletion.  That will tend to waste more memory, but I haven't thought of
something clearly better.

The patch passes a relcache-owned node tree to eval_const_expressions() via
ExecPrepareExpr().  If that stands, I should add a comment to the effect that
eval_const_expressions() must not modify in place the original tree or
reference it from the result tree.  The comment at expression_planner()
implies that's already true.  I know of no exceptions, but I have not audited.
How reasonable is reliance on continued conformance to that rule?  An
alternative is to have the relcache always return a node tree copy, like it
does in RelationGetIndexExpressions().  That sacrifices about 1/4 of the
performance gain.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] nulls in GIN index

2014-08-10 Thread worthy7

http://www.postgresql.org/docs/9.1/static/gin-implementation.html
As of PostgreSQL 9.1, NULL key values can be included in the index. Also,
placeholder NULLs are included in the index for indexed items that are NULL
or contain no keys according to extractValue. This allows searches that
should find empty items to do so.

How do I define an index that includes nulls then?



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/nulls-in-GIN-index-tp5814384.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-10 Thread Noah Misch
[Due for a new subject line?]

On Sat, Aug 09, 2014 at 08:16:01PM +0200, Andres Freund wrote:
 On 2014-08-09 14:09:36 -0400, Tom Lane wrote:
  Andres Freund and...@2ndquadrant.com writes:
   On 2014-08-09 14:00:49 -0400, Tom Lane wrote:
   I don't think it's anywhere near as black-and-white as you guys claim.
   What it comes down to is whether allowing existing transactions/sessions
   to finish is more important than allowing new sessions to start.
   Depending on the application, either could be more important.
  
   Nah. The current behaviour circumvents security measures we normally
   consider absolutely essential. If the postmaster died some bad shit went
   on. The likelihood of hitting corner case bugs where it's important that
   we react to a segfault/panic with a restart/crash replay is rather high.
  
  What's your point?  Once a new postmaster starts, it *will* do a crash
  restart, because certainly no shutdown checkpoint ever happened.
 
 That's not saying much. For one, there can be online checkpoints in that
 time. So it's certainly not guaranteed (or even all that likely) that
 all the WAL since the incident is replayed.  For another, it can be
 *hours* before all the backends finish.
 
 IIRC we'll continue to happily write WAL and everything after postmaster
 (and possibly some backends, corrupting shmem) have crashed. The
 bgwriter, checkpointer, backends will continue to write dirty buffers to
 disk. We'll IIRC continue to write checkpoints.   That's simply not
 things we should be doing after postmaster crashed if we can avoid at
 all.

The basic support processes, including the checkpointer, exit promptly upon
detecting a postmaster exit.  Checkpoints cease.  Your central point still
stands.  WAL protects data integrity only to the extent that we stop writing
it after shared memory ceases to be trustworthy.  Crash recovery of WAL
written based on corrupt buffers just reproduces the corruption.

  The
  only issue here is what grace period existing orphaned backends are given
  to finish their work --- and it's not possible for the answer to that
  to be zero, so you don't get to assume that nothing happens in
  backend-land after the instant of postmaster crash.

Our grace period for active backends after unclean exit of one of their peers
is low, milliseconds to seconds.  Our grace period for active backends after
unclean exit of the postmaster is unconstrained.  At least one of those
policies has to be wrong.  Like Andres and Robert, I pick the second one.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal to add a QNX 6.5 port to PostgreSQL

2014-08-10 Thread Stephen Frost
* Noah Misch (n...@leadboat.com) wrote:
 [Due for a new subject line?]

Probably.

 Our grace period for active backends after unclean exit of one of their peers
 is low, milliseconds to seconds.  Our grace period for active backends after
 unclean exit of the postmaster is unconstrained.  At least one of those
 policies has to be wrong.  Like Andres and Robert, I pick the second one.

Ditto for me.  The postmaster going away really is a bad sign and the
confusion due to leftover processes is terrible for our users.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] 9.5: Memory-bounded HashAgg

2014-08-10 Thread Tomas Vondra
Hi,

it's 1AM here, so only a few comments after quickly reading the patch.

On 10.8.2014 23:26, Jeff Davis wrote:
 This patch is requires the Memory Accounting patch, or something
 similar to track memory usage.

I think the patch you sent actually includes the accounting patch. Is
that on purpose, or by accident?

I'd suggest keeping these two patches separate.


 Rough Design:
 
 Change the hash aggregate algorithm to accept a generic work item, 
 which consists of an input file as well as some other bookkeeping 
 information.
 
 Initially prime the algorithm by adding a single work item where the 
 file is NULL, indicating that it should read from the outer plan.
 
 If the memory is exhausted during execution of a work item, then 
 continue to allow existing groups to be aggregated, but do not allow
 new groups to be created in the hash table. Tuples representing new
 groups are saved in an output partition file referenced in the work
 item that is currently being executed.
 
 When the work item is done, emit any groups in the hash table, clear
 the hash table, and turn each output partition file into a new work
 item.
 
 Each time through at least some groups are able to stay in the hash 
 table, so eventually none will need to be saved in output
 partitions, no new work items will be created, and the algorithm will
 terminate. This is true even if the number of output partitions is
 always one.

So once a group gets into memory, it stays there? That's going to work
fine for aggregates with fixed-size state (int4, or generally state that
gets allocated and does not grow), but I'm afraid for aggregates with
growing state (as for example array_agg and similar) that's not really a
solution.

How difficult would it be to dump the current state into a file (and
remove them from the hash table)?

While hacking on the hash join, I envisioned the hash aggregate might
work in a very similar manner, i.e. something like this:

  * nbatches=1, nbits=0
  * when work_mem gets full = nbatches *= 2, nbits += 1
  * get rid of half the groups, using nbits from the hash
 = dump the current states into 'states.batchno' file
 = dump further tuples to 'tuples.batchno' file
  * continue until the end, or until work_mem gets full again

This is pretty much what the hashjoin does, except that the join needs
to batch the outer relation too (which hashagg does not need to do).
Otherwise most of the batching logic can be copied.

It also seems to me that the logic of the patch is about this:

  * try to lookup the group in the hash table
* found = call the transition function
* not found
* enough space = call transition function
* not enough space = tuple/group goes to a batch

Which pretty much means all tuples need to do the lookup first. The nice
thing on the hash-join approach is that you don't really need to do the
lookup - you just need to peek at the hash whether the group belongs to
the current batch (and if not, to which batch it should go).

Of course, that would require the ability to dump the current state of
the group, but for the aggregates using basic types as a state
(int4/int8, ...) with fixed-length state that's trivial.

For aggregates using 'internal' to pass pointers that requires some help
from the author - serialization/deserialization functions.

 Open items:
* costing

Not sure how this is done for the hash-join, but I guess that might be a
good place for inspiration.

* EXPLAIN details for disk usage
* choose number of partitions intelligently

What is the purpose of HASH_DISK_MAX_PARTITIONS? I mean, when we decide
we need 2048 partitions, why should we use less if we believe it will
get us over work_mem?

* performance testing
 
 Initial tests indicate that it can be competitive with sort+groupagg
 when the disk is involved, but more testing is required.

For us, removing the sort is a big deal, because we're working with
100M rows regularly. It's more complicated though, because the sort is
usually enforced by COUNT(DISTINCT) and that's not going to disappear
because of this patch. But that's solvable with a custom aggregate.

Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Function to know last log write timestamp

2014-08-10 Thread Tatsuo Ishii
We can know the LSN of last committed WAL record on primary by using
pg_current_xlog_location(). It seems there's no API to know the time
when the WAL record was created. I would like to know standby delay by
using pg_last_xact_replay_timestamp() and such that API.

If there's no such a API, it would be useful to invent usch an API IMO.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function to know last log write timestamp

2014-08-10 Thread Fujii Masao
On Mon, Aug 11, 2014 at 9:23 AM, Tatsuo Ishii is...@postgresql.org wrote:
 We can know the LSN of last committed WAL record on primary by using
 pg_current_xlog_location(). It seems there's no API to know the time
 when the WAL record was created. I would like to know standby delay by
 using pg_last_xact_replay_timestamp() and such that API.

 If there's no such a API, it would be useful to invent usch an API IMO.

+1

I proposed that function before, but unfortunately it failed to be applied.
But I still think that function is useful to calculate the replication delay.
The past discussion is
http://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function to know last log write timestamp

2014-08-10 Thread Tatsuo Ishii
 On Mon, Aug 11, 2014 at 9:23 AM, Tatsuo Ishii is...@postgresql.org wrote:
 We can know the LSN of last committed WAL record on primary by using
 pg_current_xlog_location(). It seems there's no API to know the time
 when the WAL record was created. I would like to know standby delay by
 using pg_last_xact_replay_timestamp() and such that API.

 If there's no such a API, it would be useful to invent usch an API IMO.
 
 +1
 
 I proposed that function before, but unfortunately it failed to be applied.
 But I still think that function is useful to calculate the replication delay.
 The past discussion is
 http://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com

I looked into the thread briefly and found Simon and Robert gave -1
for this because of performance concern. I'm not sure if it's a actual
performance penalty or not. Maybe we need to major the penalty?

However I still think that kind of API is very useful because
replication delay is one of the big DBA's concern. Why don't we have a
switch to enable the API for DBAs who think the priority is
replication delay, over small performance penalty?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-08-10 Thread Michael Paquier
On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Great! This is really the feature which I really want.
 Though I forgot why we missed this feature when
 we had added the synchronous replication feature,
 maybe it's worth reading the old discussion which
 may suggest the potential problem of N sync standbys.
Sure, I'll double check. Thanks for your comments.

 I just tested this feature with synchronous_standby_num = 2.
 I started up only one synchronous standby and ran
 the write transaction. Then the transaction was successfully
 completed, i.e., it didn't wait for two standbys. Probably
 this is a bug of the patch.

Oh OK, yes this is a bug of what I did. The number of standbys to wait
for takes precedence on the number of standbys found in the list of
active WAL senders. I changed the patch to take into account that
behavior. So for example if you have only one sync standby connected,
and synchronous_standby_num = 2, client waits indefinitely.

 And, you forgot to add the line of synchronous_standby_num
 to postgresql.conf.sample.
Yep, right.

On top of that, I refactored the code in such a way that
pg_stat_get_wal_senders and SyncRepReleaseWaiters rely on a single API
to get the list of synchronous standbys found. This reduces code
duplication, duplication that already exists in HEAD...
Regards,
-- 
Michael
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 2586,2597  include_dir 'conf.d'
  Specifies a comma-separated list of standby names that can support
  firsttermsynchronous replication/, as described in
  xref linkend=synchronous-replication.
! At any one time there will be at most one active synchronous standby;
! transactions waiting for commit will be allowed to proceed after
! this standby server confirms receipt of their data.
! The synchronous standby will be the first standby named in this list
! that is both currently connected and streaming data in real-time
! (as shown by a state of literalstreaming/literal in the
  link linkend=monitoring-stats-views-table
  literalpg_stat_replication//link view).
  Other standby servers appearing later in this list represent potential
--- 2586,2598 
  Specifies a comma-separated list of standby names that can support
  firsttermsynchronous replication/, as described in
  xref linkend=synchronous-replication.
! At any one time there will be at a number of active synchronous standbys
! defined by varnamesynchronous_standby_num/; transactions waiting
! for commit will be allowed to proceed after those standby servers
! confirms receipt of their data. The synchronous standbys will be
! the first entries named in this list that are both currently connected
! and streaming data in real-time (as shown by a state of
! literalstreaming/literal in the
  link linkend=monitoring-stats-views-table
  literalpg_stat_replication//link view).
  Other standby servers appearing later in this list represent potential
***
*** 2627,2632  include_dir 'conf.d'
--- 2628,2652 
/listitem
   /varlistentry
  
+  varlistentry id=guc-synchronous-standby-num xreflabel=synchronous_standby_num
+   termvarnamesynchronous_standby_num/varname (typeinteger/type)
+   indexterm
+primaryvarnamesynchronous_standby_num/ configuration parameter/primary
+   /indexterm
+   /term
+   listitem
+para
+ Specifies the number of standbys that support
+ firsttermsynchronous replication/, as described in
+ xref linkend=synchronous-replication, and listed as the first
+ elements of xref linkend=guc-synchronous-standby-names.
+/para
+para
+ Default value is 1.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-vacuum-defer-cleanup-age xreflabel=vacuum_defer_cleanup_age
termvarnamevacuum_defer_cleanup_age/varname (typeinteger/type)
indexterm
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 1081,1092  primary_slot_name = 'node_a_slot'
  WAL record is then sent to the standby. The standby sends reply
  messages each time a new batch of WAL data is written to disk, unless
  varnamewal_receiver_status_interval/ is set to zero on the standby.
! If the standby is the first matching standby, as specified in
! varnamesynchronous_standby_names/ on the primary, the reply
! messages from that standby will be used to wake users waiting for
! confirmation that the commit record has been received. These parameters
! allow the administrator to specify which standby servers should be
! synchronous standbys. Note 

Re: [HACKERS] Function to know last log write timestamp

2014-08-10 Thread Fujii Masao
On Mon, Aug 11, 2014 at 10:48 AM, Tatsuo Ishii is...@postgresql.org wrote:
 On Mon, Aug 11, 2014 at 9:23 AM, Tatsuo Ishii is...@postgresql.org wrote:
 We can know the LSN of last committed WAL record on primary by using
 pg_current_xlog_location(). It seems there's no API to know the time
 when the WAL record was created. I would like to know standby delay by
 using pg_last_xact_replay_timestamp() and such that API.

 If there's no such a API, it would be useful to invent usch an API IMO.

 +1

 I proposed that function before, but unfortunately it failed to be applied.
 But I still think that function is useful to calculate the replication delay.
 The past discussion is
 http://www.postgresql.org/message-id/cahgqgwf3zjfunej5ka683ku5rqubtswtqfq7g1x0g34o+jx...@mail.gmail.com

 I looked into the thread briefly and found Simon and Robert gave -1
 for this because of performance concern. I'm not sure if it's a actual
 performance penalty or not. Maybe we need to major the penalty?

I think that the performance penalty is negligible small because the patch
I posted before added only three stores to shared memory per commit/abort.
No time-consuming operations like lock, gettimeofday, etc were added.
Of course, it's worth checking whether the penalty is actually small or not,
though.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hokey wrong versions of libpq in apt.postgresql.org

2014-08-10 Thread Joshua D. Drake




The SO major version should be sufficient for applications to operate
normally.  If that isn't the case then I agree that we need to review
the changes we are making to see if the SO should be bumped.  Note that
Debian's viewpoint on this is more along the lines of why build against
an old version if the latest one, whose major SO is the same, exists?
This is largely to avoid the hell of versioned Build-Depends and having
to support multiple sets of -dev packages concurrently (consider that
builds generally only look for the unversioned '.so' file also..).


At the same time, there would be more merit in Josh's position if he
could point to any *actual* libpq changes that might pose application
compatibility problems.  I don't recall that we've made many such,
so the above argument might just be hypothetical.


I don't believe it's hypothetical from Josh's perspective, but I didn't
follow the threads completely to confirm that there was a real issue.
If there is a real issue here, I'd most likely vote to fix it and
backpatch it as a bug, though it's not clear if that would be considered
'good enough' for this case.


The issue that I specifically ran into is that by using 
apt.postgresql.org in its default configuration, I can't add certain 
extensions (without jumping through hoops). Simple:


Assume a running 9.2.9 from apt.postgresql.org
apt-get install pgxnclient
sudo pgxnclient install pg_repack



Doesn't work. Because it is looking for libs in /var/lib/postgresql/9.2 
not /var/lib/postgresql/9.3.


It also failed on another extension (but I don't recall which one it is).


Yes. I can get the 9.2 libpq but that isn't really the point is it? This 
is quite unexpected behavior from an operational perspective. It should 
just work but it doesn't because we are shipping from 
apt.postgresql.org a 9.3 version of libpq.


(protocol versions don't really matter here, this is an operational thing).

Sincerely,

Joshua D. Drake







Thanks,

Stephen




--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers

2014-08-10 Thread Fujii Masao
On Mon, Aug 11, 2014 at 11:54 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Mon, Aug 11, 2014 at 1:31 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Aug 9, 2014 at 3:03 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 Great! This is really the feature which I really want.
 Though I forgot why we missed this feature when
 we had added the synchronous replication feature,
 maybe it's worth reading the old discussion which
 may suggest the potential problem of N sync standbys.
 Sure, I'll double check. Thanks for your comments.

 I just tested this feature with synchronous_standby_num = 2.
 I started up only one synchronous standby and ran
 the write transaction. Then the transaction was successfully
 completed, i.e., it didn't wait for two standbys. Probably
 this is a bug of the patch.

 Oh OK, yes this is a bug of what I did. The number of standbys to wait
 for takes precedence on the number of standbys found in the list of
 active WAL senders. I changed the patch to take into account that
 behavior. So for example if you have only one sync standby connected,
 and synchronous_standby_num = 2, client waits indefinitely.

Thanks for updating the patch! Again I tested the feature and found something
wrong. I set synchronous_standby_num to 2 and started three standbys. Two of
them are included in synchronous_standby_names, i.e., they are synchronous
standbys. That is, the other one standby is always asynchronous. When
I shutdown one of synchronous standbys and executed the write transaction,
the transaction was successfully completed. So the transaction doesn't wait for
two sync standbys in that case. Probably this is a bug.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-08-10 Thread Amit Kapila
On Mon, Aug 4, 2014 at 11:41 AM, Dilip kumar dilip.ku...@huawei.com wrote:

 On 31 July 2014 10:59, Amit kapila Wrote,



 Thanks for the review and valuable comments.
 I have fixed all the comments and attached the revised patch.

I have again looked into your revised patch and would like
to share my findings with you.

1.
+Number of parallel connections to perform the operation. This
option will enable the vacuum
+operation to run on parallel connections, at a time one table will
be operated on one connection.

a. How about describing w.r.t asynchronous connections
instead of parallel connections?
b. It is better to have line length as lesser than 80.
c. As you are using multiple connections to achieve parallelism,
I suggest you add a line in your description indicating user should
verify max_connections parameters.  Something similar to pg_dump:

pg_dump will open njobs + 1 connections to the database, so make
  sure your max_connections setting is high enough to accommodate
  all connections.


2.
+So at one time as many tables will be vacuumed parallely as number
of jobs.

can you briefly mention about the case when number of jobs
is more than number of tables?

3.
+ /* When user is giving the table list, and list is smaller then
+ * number of tables
+ */
+ if (tbl_count  (parallel  tbl_count))
+ parallel = tbl_count;
+

Again here multiline comments are wrong.

Some other instances are as below:
a.
/* This will give the free connection slot, if no slot is free it will
* wait for atleast one slot to get free.
*/
b.
/* if table list is not provided then we need to do vaccum for whole DB
* get the list of all tables and prpare the list
*/
c.
/* Some of the slot are free, Process the results for slots whichever
*  are free
*/


4.
src/bin/scripts/vacuumdb.c:51: indent with spaces.
+ bool analyze_only, bool freeze, PQExpBuffer sql);
src/bin/scripts/vacuumdb.c:116: indent with spaces.
+int parallel = 0;
src/bin/scripts/vacuumdb.c:198: indent with spaces.
+optind++;
src/bin/scripts/vacuumdb.c:299: space before tab in indent.
+   vacuum_one_database(dbname, full, verbose,
and_analyze,

There are lot of redundant whitespaces, check with
git diff --check and fix them.


5.
res = executeQuery(conn,
select relname, nspname from pg_class c, pg_namespace ns
 where (relkind = \'r\' or relkind = \'m\')
 and c.relnamespace = ns.oid order by relpages desc,
progname, echo);

a. Here you need to use SQL keywords in capital letters, refer one
of the other caller of executeQuery() in vacuumdb.c
b. Why do you need this condition c.relnamespace = ns.oid in above
query?
I think to get the list of required objects from pg_class, you don't
need to have a join with pg_namespace.

6.
vacuum_parallel()
{
..
if (!tables || !tables-head)
{
..
tbl_count++;
}
..
}

a. Here why you need a separate variable (tbl_count) to verify number
asynchronous/parallel connections you want, why can't we use ntuple?
b. there is a warning in code (I have compiled it on windows) as well
related to this variable.
vacuumdb.c(695): warning C4700: uninitialized local variable 'tbl_count'
used


7.
Fix for one of my previous comment is as below:
GetQueryResult()
{
..
if (!r  !completedb)
..
}

Here I think some generic errors like connection broken or others
will also get ignored.  Is it possible that we can ignore particular
error which we want to ignore without complicating the code?

Also in anycase add comments to explain why you are ignoring
error for particular case.


8.
+ fprintf(stderr, _(%s: Number of parallel \jobs\ should be at least
1\n),
+ progname);
formatting of 2nd line progname is not as per standard (you can refer other
fprintf in the same file).

9. + int parallel = 0;
I think it is better to name it as numAsyncCons or something similar.

10. It is better if you can add function header for newly added
functions.


 Test2: (as per your scenario, where actual vacuum time is very less)

 Vacuum done for complete DB

 8 tables all with 1 records and few dead tuples

I think this test is missing in attached file.  Few means?
Can you try with 0.1%, 1% of dead tuples in table and try to
take time in milliseconds if it is taking less time to complete
the test.


PS -
It is better if you mention against each review comment/suggestion
what you have done, because in some cases it will help reviewer to
understand your fix easily and as author you will also be sure that
all got fixed.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Support for N synchronous standby servers

2014-08-10 Thread Michael Paquier
On Mon, Aug 11, 2014 at 1:26 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Thanks for updating the patch! Again I tested the feature and found
something
 wrong. I set synchronous_standby_num to 2 and started three standbys. Two
of
 them are included in synchronous_standby_names, i.e., they are synchronous
 standbys. That is, the other one standby is always asynchronous. When
 I shutdown one of synchronous standbys and executed the write transaction,
 the transaction was successfully completed. So the transaction doesn't
wait for
 two sync standbys in that case. Probably this is a bug.
Well, that's working in my case :)
Please see below with 4 nodes: 1 master and 3 standbys on same host. Master
listens to 5432, other nodes to 5433, 5434 and 5435. Each standby's
application_name is node_$PORT
=# show synchronous_standby_names ;
 synchronous_standby_names
---
 node_5433,node_5434
(1 row)
=# show synchronous_standby_num ;
 synchronous_standby_num
-
 2
(1 row)
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
 application_name | replay_delta | sync_priority | sync_state
--+--+---+
 node_5433|0 | 1 | sync
 node_5434|0 | 2 | sync
 node_5435|0 | 0 | async
(3 rows)
=# create table aa (a int);
CREATE TABLE
[...]
-- Stopped node with port 5433:
[...]
=# SELECT application_name,
pg_xlog_location_diff(sent_location, flush_location) AS replay_delta,
sync_priority,
sync_state
FROM pg_stat_replication ORDER BY replay_delta ASC, application_name;
 application_name | replay_delta | sync_priority | sync_state
--+--+---+
 node_5434|0 | 2 | sync
 node_5435|0 | 0 | async
(2 rows)
=# create table ab (a int);
^CCancel request sent
WARNING:  01000: canceling wait for synchronous replication due to user
request
DETAIL:  The transaction has already committed locally, but might not have
been replicated to the standby(s).
LOCATION:  SyncRepWaitForLSN, syncrep.c:227
CREATE TABLE

Regards,
-- 
Michael