date:20130412

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Jeff Janes

On Thursday, April 11, 2013, Kevin Grittner wrote:


  I also log the number of pages truncated at the time it gave up,
  as it would be nice to know if it is completely starving or
  making some progress.

 If we're going to have the message, we should make it useful.  My
 biggest question here is not whether we should add this info, but
 whether it should be DEBUG instead of LOG


I like it being LOG.  If it were DEBUG, I don't think anyone would be
likely to see it when they needed to, as it happens sporadically on busy
servers and I don't think people would run those with DEBUG on.  I figure
it is analogous to an autovacuum cancel message it partially replaces, and
those are LOG.





  Also, I think that permanently boycotting doing autoanalyze
  because someone is camping out on an access share lock (or
  because there are a never-ending stream of overlapping locks) and
  so the truncation cannot be done is a bit drastic, especially for
  inclusion in a point release.

 That much is not a change in the event that the truncation does not
 complete.


OK, I see that now.  In the old behavior, of the lock was acquired, but
then we were shoved off from it, the analyze was not done.  But, in the old
behavior if the lock was never acquired at all, then it would go ahead to
do the autoanalyze, and that has changed.   That is they way I was testing
it (camping out on an access shared lock so the access exclusive could
never be granted in the first place; because intercepting it during the
truncate phase was hard to do) and I just assumed the behavior I saw would
apply to both cases, but it does not.


 I have seen cases where the old logic head-banged for
 hours or days without succeeding at the truncation attempt in
 autovacuum, absolutely killing performance until the user ran an
 explicit VACUUM.  And in the meantime, since the deadlock detection
 logic was killing autovacuum before it got to the analyze phase,
 the autoanalyze was never done.


OK, so there three problems.  It would take a second to yield, in doing so
it would abandon all the progress it had made in that second rather than
saving it, and it would tight loop (restricted by naptime) on this because
of the lack of analyze.  So it fixed the first two in a way that seems an
absolute improvement for the auto case, but it made the third one worse in
a common case, where it never acquires the lock in the first place, and so
doesn't analyze when before it did in that one case.



 Perhaps the new logic should go ahead and get its lock even on a
 busy system (like the old logic),


As far as I can tell, the old logic was always conditional on the
AccessExlusive lock acquisition, whether it was manual or auto.

Cheers,

Jeff

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Hannu Krosing


On 04/11/2013 07:29 PM, Fujii Masao wrote:

On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:

You just shut down the old master and let the standby catch
up (takas a few microseconds ;) ) before you promote it.

After this you can start up the former master with recovery.conf
and it will follow nicely.

No. When you shut down the old master, it might not have been
able to send all the WAL records to the standby.

In what cases (other than a standby lagging too much or
not listening at all) have you observed this ?

I have observed
this situation several times. So in your approach, new standby
might fail to catch up with the master nicely.

the page http://wiki.postgresql.org/wiki/Streaming_Replication claims this:

* Graceful shutdown

When smart/fast shutdown is requested, the primary waits to exit
until XLOG records have been sent to the standby, up to the
shutdown checkpoint record.

Maybe you were requesting immediate shutdown ?

Regards
Hannu Krosing



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Sameer Thakur

Hello,

The current process of transferring data files from one cluster to another
by using pg_dump and pg_restore is time consuming.

The proposed tool tries to make migration faster for tables and indices
only by copying their binary data files. This is like pg_upgrade but used
for migration of table and indices

* *The discussion here @
http://www.postgresql.org/message-id/caa-alv5cqf09zvfrcb1xxuqlsp-oux0s_hq6ryscd6ctami...@mail.gmail.com

speaks of possibility detaching/attaching  databases as an alternative to
dump/restore. But the process of freezing XID’s and zeroing out LSN’s make
the solution equally time consuming if not more.

 But if we consider just tables and indexes to be detached/reattached,
would this be a viable alternative to dump and restore of tables?

 The same discussion indicates it could be done but is more complicated as
one has to deal with system catalogs of the newly mounted table and map old
OID’s to new ones. This is required to ensure consistency in roles, and
objects owned by those roles.

 We would also need to ensure LSN values of the reattached pages are less
than the current WAL endpoint in receiver.

 Are there any more issues we need to be aware of?

regards

Sameer

Re: [HACKERS] [COMMITTERS] pgsql: Add sql_drop event for event triggers

2013-04-12 Thread Dimitri Fontaine

Tom Lane t...@sss.pgh.pa.us writes:
 Yeah, I was just looking at the IfSupported variant.  In the structure
 I just suggested (separate ProcessSlowUtility function), we could make
 that work by having switch cases for some statements in both functions,

I've done it the way you propose here, and then in the Slow variant we
have two set of cases again: those with some manual transactionnal
behavior or some other code complexities, and the really simple ones.

The attached patch involves a second layer of distinction to simplify
the code fully and remove all the Event Trigger related macrology that I
didn't much like. Maybe that's going a tad too far, see what you think.

Of course the patch passes make check.

Also, some switch cases are now duplicated for the sole benefit of
keeping some compiler help, but I don't know how much help we're talking
about now that we have 3 different switches. It seemed cleaner to do it
that way as it allows to ERROR out in the very last default case.

Finally, I've been surprised to find out that those cases are only
triggering for ddl_event_start (and not ddl_event_end), I think
that's a bug we should be fixing:

case T_AlterDomainStmt:
case T_DefineStmt:
case T_IndexStmt:   /* CREATE INDEX */

The T_IndexStmt should be triggering ddl_event_end when stmt-concurrent
is false, and that's not the case in the code in the master's branch.
The two other cases should just be moved to the later switch so that
both start and end triggers happen.

Or maybe I'm missing something here. Given that it might as well be the
case, I did only refactor the code keeping its current behavior rather
than fixing what I think is a bug while doing so.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

*** a/src/backend/tcop/utility.c
--- b/src/backend/tcop/utility.c
***
*** 68,73 
--- 68,80 
  /* Hook for plugins to get control in ProcessUtility() */
  ProcessUtility_hook_type ProcessUtility_hook = NULL;
  
+ /* local function declarations */
+ void standard_ProcessSlowUtility(Node *parsetree,
+  const char *queryString,
+  ParamListInfo params,
+  DestReceiver *dest,
+  char *completionTag,
+  ProcessUtilityContext context);
  
  /*
   * Verify user has ownership of specified relation, else ereport.
***
*** 342,411  ProcessUtility(Node *parsetree,
  dest, completionTag, context);
  }
  
- #define InvokeDDLCommandEventTriggers(parsetree, fncall) \
- 	do { \
- 	if (isCompleteQuery) \
- { \
- 			EventTriggerDDLCommandStart(parsetree); \
- 		} \
- 		fncall; \
- if (isCompleteQuery) \
- { \
- 			EventTriggerSQLDrop(parsetree); \
- 			EventTriggerDDLCommandEnd(parsetree); \
- 		} \
- 	} while (0)
- 
- #define InvokeDDLCommandEventTriggersIfSupported(parsetree, fncall, objtype) \
- 	do { \
- 		bool	_supported = EventTriggerSupportsObjectType(objtype); \
- 		\
- 		if (_supported) \
- 		{ \
- 			EventTriggerDDLCommandStart(parsetree); \
- 		} \
- 		fncall; \
- 		if (_supported) \
- 		{ \
- 			EventTriggerSQLDrop(parsetree); \
- 			EventTriggerDDLCommandEnd(parsetree); \
- 		} \
- 	} while (0)
- 
  /*
!  * UTILITY_BEGIN_QUERY and UTILITY_END_QUERY are a pair of macros to enclose
!  * execution of a single DDL command, to ensure the event trigger environment
!  * is appropriately set up before starting, and tore down after completion or
!  * error.
   */
- #define UTILITY_BEGIN_QUERY(isComplete) \
- 	do { \
- 		bool		_needCleanup; \
- 		\
- 		_needCleanup = (isComplete)  EventTriggerBeginCompleteQuery(); \
- 		\
- 		PG_TRY(); \
- 		{ \
- 			/* avoid empty statement when followed by a semicolon */ \
- 			(void) 0
- 
- #define UTILITY_END_QUERY() \
- 		} \
- 		PG_CATCH(); \
- 		{ \
- 			if (_needCleanup) \
- 			{ \
- EventTriggerEndCompleteQuery(); \
- 			} \
- 			PG_RE_THROW(); \
- 		} \
- 		PG_END_TRY(); \
- 		if (_needCleanup) \
- 		{ \
- 			EventTriggerEndCompleteQuery(); \
- 		} \
- 	} while (0)
- 
  void
  standard_ProcessUtility(Node *parsetree,
  		const char *queryString,
--- 349,364 
  dest, completionTag, context);
  }
  
  /*
!  * Processing a Utility statement is now divided into two functions. A fast
!  * path is taken for statements for which we have not Event Trigger support, so
!  * that there's no need to do any Event Trigger specific setup.
!  *
!  * It's not just a performance requirement, mind you, because refreshing the
!  * Event Triggers cache (see src/backend/utils/cache/evtcache.c) is prone to
!  * taking a snapshot, which can not occur with in a START TRANSACTION statement
!  * for example.
   */
  void
  standard_ProcessUtility(Node *parsetree,
  		const char *queryString,
***
*** 415,429  standard_ProcessUtility(Node *parsetree,
  		ProcessUtilityContext context)
  {
  	bool		isTopLevel = (context ==

Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-12 Thread Dimitri Fontaine

Tom Lane t...@sss.pgh.pa.us writes:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 This idea doesn't work because of back-patch considerations (i.e. we
 would not be able to create the functions in back branches, and so this
 new style of pg_dump would only work with future server versions).  So

That is a policy question, not a technical one.

We could either add the new functions in the backend binary itself, or
provide it as an extension that pg_dump would know to install when
needed, if we decided it's ok.

My understanding is that will need to change that policy anyways the day
we have a page disk format change and pg_upgrade needs to flag the old
cluster pages with the old page version number before being able to run,
or something.

 There are other issues too, in particular that most of the backend's
 code tends to work on SnapshotNow time whereas pg_dump would really
 prefer it was all done according to the transaction snapshot.  We have

Would that be solved by having MVCC catalogs, or the backend code you're
talking about wouldn't be included in there? (which would be surprising
to me, as much as trumping the benefits of MVCC catalogs, but well).

 In any case, push it to the backend offers no detectable help with the
 core design issue here, which is figuring out what functionality needs
 to be exposed with what API.

Andrew did begin to work on that parts with the Retail DDL project. We
know of several getddl implementation, and you can also have a look at
the pg_dump -Fs (split format) patch that didn't make it for 9.3, where
some API work has been done.

The need exists and some thinking over the API to get here did happen.
Some more certainly needs to be done, granted.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-12 Thread Dimitri Fontaine

Michael Paquier michael.paqu...@gmail.com writes:
 I recall discussions about reverse engineering of a parsed query tree in
 the event trigger threads but nothing has been committed I think. Also, you

Yes. The name used in there was Normalized Command String.

 need to consider that implementing such reverse engineering mechanism in
 core might not be a good thing for new features and maintenance, as it
 would mean that it is necessary to change those APIs consistently with what
 is added on the parsing side.

The goal is to retain the same API, which is quite simple:

  function get_command_string(Node *parsetree) returns text

At the SQL level, the Node * is of datatype internal and you can't
forge it, you need to be given it in some ways. In the Event Trigger
case we though of a TG_PARSETREE magic variable, or maybe another
function get_current_parsetree() that only work when called from an
event trigger.

The other part of the API of course is how to represent the data, and as
we're talking about a Normalized Command String, there's no choice but
issuing a valid SQL command string that the server would know how to
execute and which would have the same side effects.

So of course a 9.3 and a 9.4 server equiped with that hypothetical
function would behave differently when the syntax did change. And that's
exactly why I think it the best choice here is to have that code
embedded and maintained in core.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-12 Thread Shuai Fan


On 04/11/2013 11:48 PM, Andrew Dunstan wrote:
It could be interesting to have a library that would output database 
metadata in some machine readable and manipulatable format such as 
JSON or XML. One thing that's annoying about the text output pg_dump 
produces is that it's not at all structured, so if you want, for 
example, to restore a table but to a table of a different name, or to 
a different schema, then you're reduced to having to mangle the SQL by 
using hand editing or regular expression matching. Something with the 
potential to ease that pain would be worth having.


Yes. This is really interesting. Current code in pg_dump, supports 4 
ArchiveFormat, e.g. archCustom, archTar, archNull and archDirectory. 
These formats are implementation of interface pg_backup. Maybe I could 
try to add two implementation of XML and JSON.
It is worth to mention that I wrote a program to parse XML format 
file into csv one using library libxercesc a month ago, Although, this 
program is just like helloworld. But, maybe I could get benefit from 
that small program, because both of them use XML format. And what I need 
to do is try another xml library.
I had a look at JSON on wiki. The format is a little like XML. Both 
of them are nested. And there are some library could be used, e.g. 
libjson (or json.c, or other json library writting in C) and libxml2 (or 
something else).


BTW, could it be an idea for GSOC? If so, I can have a try. Add XML 
and JSON output format for pg_dump.


Thank you all for your attention.


Best regards,
 Shuai





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Andres Freund

On 2013-04-12 02:29:01 +0900, Fujii Masao wrote:
 On Thu, Apr 11, 2013 at 10:25 PM, Hannu Krosing ha...@2ndquadrant.com wrote:
 
  You just shut down the old master and let the standby catch
  up (takas a few microseconds ;) ) before you promote it.
 
  After this you can start up the former master with recovery.conf
  and it will follow nicely.

 No. When you shut down the old master, it might not have been
 able to send all the WAL records to the standby. I have observed
 this situation several times. So in your approach, new standby
 might fail to catch up with the master nicely.

It seems most of this thread is focusing on the wrong thing then. If we
really are only talking about planned failover then we need to solve
*that* not some ominous don't flush data too early which has
noticeable performance and implementation complexity problems.

I guess youre observing that not everything is replicated because youre
doing an immediate shutdown - probably because performing the shutdown
checkpoint would take too long. This seems solveable by implementing a
recovery connection command which initiates a shutdown that just
disables future WAL inserts and returns the last lsn that has been
written. Then you can fall over as soon as that llsn has been reached
and can make the previous master follow from there on without problems.

You could even teach the standby not to increment the timeline in that
case since thats safe.

The biggest issue seems to be how to implement this without another
spinlock acquisition for every XLogInsert(), but that seems possible.

Greetings,

Andres Freund

--
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Andres Freund

On 2013-04-12 11:18:01 +0530, Pavan Deolasee wrote:
 On Thu, Apr 11, 2013 at 8:39 PM, Ants Aasma a...@cybertec.at wrote:
 
  On Thu, Apr 11, 2013 at 5:33 PM, Hannu Krosing ha...@2ndquadrant.com
  wrote:
   On 04/11/2013 03:52 PM, Ants Aasma wrote:
  
   On Thu, Apr 11, 2013 at 4:25 PM, Hannu Krosing ha...@2ndquadrant.com
   wrote:
  
   The proposed fix - halting all writes of data pages to disk and
   to WAL files while waiting ACK from standby - will tremendously
   slow down all parallel work on master.
  
   This is not what is being proposed. The proposed fix halts writes of
   only data pages that are modified within the window of WAL that is not
   yet ACKed by the slave. This means pages that were recently modified
   and where the clocksweep or checkpoint has decided to evict them. This
   only affects the checkpointer, bgwriter and backends doing allocation.
   Furthermore, for the backend clocksweep case it would be reasonable to
   just pick another buffer to evict. The slowdown for most actual cases
   will be negligible.
  
   You also need to hold back all WAL writes, including the ones by
   parallel async and locally-synced transactions. Which means that
   you have to make all locally synced transactions to wait on the
   syncrep transactions committed before them.
   After getting the ACK from slave you then have a backlog of stuff
   to write locally, which then also needs to be sent to slave. Basically
   this turns a nice smooth WAL write-and-stream pipeline into a
   chunky wait-and-write-and-wait-and-stream-and-wait :P
   This may not be a problem in slight write load cases, which is
   probably the most widely happening usecase for postgres, but it
   will harm top performance and also force people to get much
   better (and more expensive) hardware than would otherways
   be needed.
 
  Why would you need to hold back WAL writes? WAL is written on master
  first and then steamed to slave as it is done now. You would only need
  hold back dirty page evictions having a recent enough LSN to not yet
  be replicated. This holding back is already done to wait for local WAL
  flushes, see bufmgr.c:1976 and bufmgr.c:669. When a page gets dirtied
  it's usage count gets bumped, so it will not be considered for
  eviction for at least one clocksweep cycle. In normal circumstances
  that will be enough time to get an ACK from the slave. When WAL is
  generated at an higher rate than can be replicated this will not be
  true. In that case backends that need to bring in new pages will have
  to wait for WAL to be replicated before they can continue. That will
  hopefully include the backends that are doing the dirtying, throttling
  the WAL generation rate. This would definitely be optional behavior,
  not something turned on by default.
 
 
 I agree. I don't think the proposes change would cause a lot of performance
 bottleneck since the proposal is to hold back writing of dirty pages until
 the WAL is replicated successfully to the standby. The heap pages are
 mostly written by the background threads often much later than the WAL for
 the change is written. So in all likelihood, there will be no wait
 involved. Of course, this will not be true for very frequently updated
 pages that must be written at a checkpoint.

I don't think that holds true at all. If you look at pg_stat_bgwriter in
any remotely bugs cluster with a hot data set over shared_buffers you'll
notice that a large percentage of writes will have been done by backends
themselves.
Yes, we need to improve on this, and we are talking about it right now
in another thread, but until thats solved this argumentation seems to
fall flat on its face.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Pavan Deolasee

On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund and...@2ndquadrant.comwrote:



 I don't think that holds true at all. If you look at pg_stat_bgwriter in
 any remotely bugs cluster with a hot data set over shared_buffers you'll
 notice that a large percentage of writes will have been done by backends
 themselves.


Even if what you are saying it true, which I am sure is, the pages that the
backend is evicting mustn't be recently used by the LRU algorithm which
means that the WAL pertaining to the last change to the page in most
likelihood is already replicated, unless the replication is really lagging
much behind. Of course, if the standby is not able to keep pace with the
master in a realistic manner then we have a problem with the approach.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-12 Thread Andres Freund

On 2013-04-12 16:58:44 +0530, Pavan Deolasee wrote:
 On Fri, Apr 12, 2013 at 4:29 PM, Andres Freund and...@2ndquadrant.comwrote:
 
 
 
  I don't think that holds true at all. If you look at pg_stat_bgwriter in
  any remotely bugs cluster with a hot data set over shared_buffers you'll
  notice that a large percentage of writes will have been done by backends
  themselves.
 
 
 Even if what you are saying it true, which I am sure is, the pages that the
 backend is evicting mustn't be recently used by the LRU algorithm which
 means that the WAL pertaining to the last change to the page in most
 likelihood is already replicated, unless the replication is really lagging
 much behind. Of course, if the standby is not able to keep pace with the
 master in a realistic manner then we have a problem with the approach.

It frequently takes time in the sub to few second range for usagecounts
in zero.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-12 Thread Hannu Krosing


On 04/11/2013 12:17 AM, Tom Lane wrote:

Alvaro Herrera alvhe...@2ndquadrant.com writes:

Hannu Krosing wrote:

Natural solution to this seems to move most of pg_dump functionality
into backend as functions, so we have pg_dump_xxx() for everything
we want to dump plus a topological sort function for getting the
objects in right order.

This idea doesn't work because of back-patch considerations (i.e. we
would not be able to create the functions in back branches, and so this
new style of pg_dump would only work with future server versions).  So
pg_dump itself would have to retain capability to dump stuff from old
servers.  This seems unlikely to fly at all, because we'd be then
effectively maintaining pg_dump in two places, both backend and the
pg_dump source code.

There are other issues too, in particular that most of the backend's
code tends to work on SnapshotNow time whereas pg_dump would really
prefer it was all done according to the transaction snapshot.

I was just thinking of moving the queries the pg_dump currently
uses into UDF-s, which do _not_ use catalog cache, but will use
the same SQL to query catalogs as pg_dump currently does
using whatever snapshot mode is currently set .

the pg_dump will need to still have the same queries for older
versions of postgresql but for new versions pg_dump  can become
catalog-agnostic.

and I think that we can retire pg_dump support for older
postgresql versions the same way we drop support for
older versions of postgresql itself.

Hannu


We have
got bugs of that ilk already in pg_dump, but we shouldn't introduce a
bunch more.  Doing this right would therefore mean that we'd have to
write a lot of duplicative code in the backend, ie, it's not clear that
we gain any synergy by pushing the functionality over.  It might
simplify cross-backend-version issues (at least for backend versions
released after we'd rewritten all that code) but otherwise I'm afraid
it'd just be pushing the problems somewhere else.

In any case, push it to the backend offers no detectable help with the
core design issue here, which is figuring out what functionality needs
to be exposed with what API.

main things I see would be

 * get_list_of_objects(object_type, pattern or namelist)
 * get_sql_def_for_object(object_type, object_name)
 * sort_by_dependency(list of [obj_type, obj_name])

from this you could easily construct most uses, especially if
sort_by_dependency(list of [obj_type, obj_name])
would be smart enough to break circular dependencies, like
turning to tables with mutual FK-s into tabledefs without
FKs + separate constraints.

Or we could always have constraints separately, so that
the ones depending on non-exported objects would be easy
to leave out

My be the dependency API analysis itself is something
worth a GSOC effort ?

Hannu


regards, tom lane




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_regress and non-default unix socket path

2013-04-12 Thread Christoph Berg

Re: To PostgreSQL Hackers 2013-04-09 20130409120807.gd26...@msgid.df7cb.de

If the patch looks too intrusive at this stage of the release, it
would be enough if the last chunk got included, which should really be
painless:

 diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
 new file mode 100644
 index b632326..d4d4279
 *** a/src/test/regress/pg_regress.c
 --- b/src/test/regress/pg_regress.c

[...]

 *** regression_main(int argc, char *argv[],
 *** 2249,2255 
*/
   header(_(starting postmaster));
   snprintf(buf, sizeof(buf),
 !  SYSTEMQUOTE \%s/postgres\ -D \%s/data\ 
 -F%s -c \listen_addresses=%s\  \%s/log/postmaster.log\ 21 SYSTEMQUOTE,
bindir, temp_install,
debug ?  -d 5 : ,
hostname ? hostname : ,
 --- 2254,2262 
*/
   header(_(starting postmaster));
   snprintf(buf, sizeof(buf),
 !  hostname  *hostname == '/'
 !  ? SYSTEMQUOTE \%s/postgres\ -D 
 \%s/data\ -F%s -k \%s\  \%s/log/postmaster.log\ 21 SYSTEMQUOTE
 !  : SYSTEMQUOTE \%s/postgres\ -D 
 \%s/data\ -F%s -c \listen_addresses=%s\  \%s/log/postmaster.log\ 21 
 SYSTEMQUOTE,
bindir, temp_install,
debug ?  -d 5 : ,
hostname ? hostname : ,

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sepgsql 2/3] Add db_schema:search permission checks

2013-04-12 Thread Robert Haas

On Mon, Apr 8, 2013 at 12:28 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 Thanks. I could find two obvious wording stuffs here, please see smaller
 one of the attached patches. I didn't fixup manner to use XXX in source
 code comments.

Committed.

 Also, the attached function-execute-permission patch is a rebased
 version. I rethought its event name should be OAT_FUNCTION_EXECUTE,
 rather than OAT_FUNCTION_EXEC according to the manner without
 abbreviation. Other portion is same as previous ones.

Great.  This looks fine to me, committed.  I also committed your
getObjectIdentity patch, which caused some regression test output
changes.  I applied the necessary correction before committing, I
think, but please check.

 BTW, if it were possible to set things up so that the test_sepgsql
 script could validate the version of the sepgsql-regtest policy
 installed, that would eliminate a certain category of errors.  I
 notice also that the regression tests themselves seem to fail if you
 invoke the script as contrib/sepgsql/test_sepgsql rather than
 ./test_sepgsql, which suggests another possible pre-validation step.

 Please see the test-script-fixup patch.
 I added cd `dirname $0` on top of the script. It makes pg_regress to
 avoid this troubles. Probably, pg_regress was unavailable to read
 sql commands to run.

 A problem regarding to validation of sepgsql-regtest policy module
 is originated by semodule commands that takes root privilege to
 list up installed policy modules. So, I avoided to use this command
 in the test_sepgsql script.
 However, I have an idea that does not raise script fail even if sudo
 semodule -l returned an error, except for a case when it can run
 correctly and the policy version is not expected one.
 How about your opinion for this check?

Not sure that's too useful.  And I don't like the idea of putting sudo
commands in a test harness script.  That seems too much like the sort
of thing bad people do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] after 9.2.4 patch vacuumdb -avz not analyzing all tables

2013-04-12 Thread Mike Broers

On further review this particular server skipped from 9.2.2 to 9.2.4.  This
is my most busy and downtime sensitive server and I was waiting on a
maintenance window to patch to 9.2.3 when 9.2.4 dropped and bumped up the
urgency.  However, I have 3 other less busy production servers that were
all running 9.2.3 for a while, didnt exhibit the problem, and still dont on
9.2.4.

psql analyze seems to work ok in the meantime, I'll report back if I
notice any problems with that.

Thanks very much for the response and investigation, it is much
appreciated!


On Thu, Apr 11, 2013 at 8:48 PM, Kevin Grittner kgri...@ymail.com wrote:

 Tom Lane t...@sss.pgh.pa.us wrote:

  However I've got to say that both of those side-effects of
  exclusive-lock abandonment seem absolutely brain dead now that I
  see them.  Why would we not bother to tell the stats collector
  what we've done?  Why would we think we should not do ANALYZE
  when we were told to?
 
  Would someone care to step forward and defend this behavior?
  Because it's not going to be there very long otherwise.

 I'm pretty sure that nobody involved noticed the impact on VACUUM
 ANALYZE command; all discussion was around autovacuum impact; and
 Jan argued that this was leaving things in a status quo for that,
 so I conceded the point and left it for a follow-on patch if
 someone felt the behavior needed to change.  Sorry for the miss.

 http://www.postgresql.org/message-id/50bb700e.8060...@yahoo.com

 As far as I'm concerned all effects on the explicit command were
 unintended and should be reverted.

 --
 Kevin Grittner
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Tom Lane

Sameer Thakur samthaku...@gmail.com writes:
 The proposed tool tries to make migration faster for tables and indices
 only by copying their binary data files.

There's 0 chance of making that work, because the two databases wouldn't
have the same notions of committed XIDs.  You apparently don't
understand what you read in the other discussion --- the steps you are
objecting to are not optional, whether copying a whole tablespace or
only one table.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sepgsql 2/3] Add db_schema:search permission checks

2013-04-12 Thread Alvaro Herrera

Robert Haas escribió:
 On Mon, Apr 8, 2013 at 12:28 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:

  Also, the attached function-execute-permission patch is a rebased
  version. I rethought its event name should be OAT_FUNCTION_EXECUTE,
  rather than OAT_FUNCTION_EXEC according to the manner without
  abbreviation. Other portion is same as previous ones.
 
 Great.  This looks fine to me, committed.  I also committed your
 getObjectIdentity patch, which caused some regression test output
 changes.  I applied the necessary correction before committing, I
 think, but please check.

I think the function-execute code path is still using
getObjectDescription rather than identity.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Analyzing bug 8049

2013-04-12 Thread Robert Haas

On Thu, Apr 11, 2013 at 1:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 This idea needs more fleshing out, but it's seeming awfully attractive
 right now.  The big problem with it is that it's going to be a more
 invasive patch than I feel terribly comfortable about back-patching.
 However, I'm not sure there's much choice, because I don't see any narrow
 fix for 9.2 that would not result in very substantial degradation of its
 optimization ability.  We can't just lobotomize equivalence-class
 processing.

 The plan I'm considering is to get this written and committed to HEAD
 in the next week, so that it can go out in 9.3beta1.  After the patch
 has survived a reasonable amount of beta testing, I'd be more comfortable
 about back-patching into 9.2.

I'm not very sanguine about the chances that back-patching this won't
provoke any screams of agony ... but I don't have a better idea,
either.  Letting queries return wrong answers isn't a superior
solution, for sure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sepgsql 2/3] Add db_schema:search permission checks

2013-04-12 Thread Robert Haas

On Fri, Apr 12, 2013 at 10:42 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Robert Haas escribió:
 On Mon, Apr 8, 2013 at 12:28 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:

  Also, the attached function-execute-permission patch is a rebased
  version. I rethought its event name should be OAT_FUNCTION_EXECUTE,
  rather than OAT_FUNCTION_EXEC according to the manner without
  abbreviation. Other portion is same as previous ones.

 Great.  This looks fine to me, committed.  I also committed your
 getObjectIdentity patch, which caused some regression test output
 changes.  I applied the necessary correction before committing, I
 think, but please check.

 I think the function-execute code path is still using
 getObjectDescription rather than identity.

Yeah, I think so.  KaiGai, can you provide a follow-on patch to fix that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] after 9.2.4 patch vacuumdb -avz not analyzing all tables

2013-04-12 Thread Scott Marlowe

Does this behavior only affect the 9.2 branch? Or was it ported to 9.1 or
9.0 or 8.4 as well?


On Thu, Apr 11, 2013 at 7:48 PM, Kevin Grittner kgri...@ymail.com wrote:

 Tom Lane t...@sss.pgh.pa.us wrote:

  However I've got to say that both of those side-effects of
  exclusive-lock abandonment seem absolutely brain dead now that I
  see them.  Why would we not bother to tell the stats collector
  what we've done?  Why would we think we should not do ANALYZE
  when we were told to?
 
  Would someone care to step forward and defend this behavior?
  Because it's not going to be there very long otherwise.

 I'm pretty sure that nobody involved noticed the impact on VACUUM
 ANALYZE command; all discussion was around autovacuum impact; and
 Jan argued that this was leaving things in a status quo for that,
 so I conceded the point and left it for a follow-on patch if
 someone felt the behavior needed to change.  Sorry for the miss.

 http://www.postgresql.org/message-id/50bb700e.8060...@yahoo.com

 As far as I'm concerned all effects on the explicit command were
 unintended and should be reverted.

 --
 Kevin Grittner
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company


 --
 Sent via pgsql-admin mailing list (pgsql-ad...@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-admin




-- 
To understand recursion, one must first understand recursion.

Re: [HACKERS] Analyzing bug 8049

2013-04-12 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 On Thu, Apr 11, 2013 at 1:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The plan I'm considering is to get this written and committed to HEAD
 in the next week, so that it can go out in 9.3beta1.  After the patch
 has survived a reasonable amount of beta testing, I'd be more comfortable
 about back-patching into 9.2.

 I'm not very sanguine about the chances that back-patching this won't
 provoke any screams of agony ... but I don't have a better idea,
 either.  Letting queries return wrong answers isn't a superior
 solution, for sure.

The only alternative I can see is to make a back-patch that just teaches
get_eclass_for_sort_expr() to compute valid nullable_relids for the sort
expression.  That's necessary code in any case, and it would fix the
immediately complained-of bug.  The thing that scares me is that it is
now clear that equivclass.c is capable of considering two expressions to
be equivalent when they should not be; that is, I'm afraid there are
variants of the problem that would not be cured by such a limited
back-patch.  But maybe I should try to create such an example before
proposing the more invasive approach.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_regress and non-default unix socket path

2013-04-12 Thread Robert Haas

On Tue, Apr 9, 2013 at 8:08 AM, Christoph Berg c...@df7cb.de wrote:
 Debian has been patching pg_regress for years because our default unix
 socket directory is /var/run/postgresql, but that is not writable by
 the build user at build time. This used to be a pretty ugly make-
 patch-make check-unpatch-make install patch dance, but now it is a
 pretty patch that makes pg_regress accept --host=/tmp.

 The patch is already part of the .deb files on apt.postgresql.org and
 passes all regression test suites there.

 Please consider it for 9.3. (And maybe backpatch it all the way...)

The hunk that changes the messages might need some thought so that it
doesn't cause a translation regression.  But in general I see no
reason not to do this before we release beta1.  It seems safe enough,
and changes that reduce the need for packagers to carry private
patches are, I think, generally a good thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] How to use EXPLAIN (TIMING)

2013-04-12 Thread Robins Tharakan

Hi,

While creating regression tests for EXPLAIN I am (somehow) unable to get
(TIMING) option to work with EXPLAIN!

I must be doing something stupid but all these options below just didn't
work. Could someone point me to the right direction?

mpf2=# explain (TIMING) SELECT 1;
ERROR:  EXPLAIN option TIMING requires ANALYZE

mpf2=# EXPLAIN (TIMING FALSE) SELECT 1;
QUERY PLAN
--
 Result  (cost=0.00..0.01 rows=1 width=0)
(1 row)

mpf2=# EXPLAIN ANALYSE (TIMING) SELECT 1;
ERROR:  syntax error at or near TIMING
LINE 1: EXPLAIN ANALYSE (TIMING) SELECT 1;
 ^
mpf2=# EXPLAIN ANALYSE (TIMING TRUE) SELECT 1;
ERROR:  syntax error at or near TIMING
LINE 1: EXPLAIN ANALYSE (TIMING TRUE) SELECT 1;
 ^
mpf2=# EXPLAIN ANALYSE TIMING TRUE SELECT 1;
ERROR:  syntax error at or near TIMING
LINE 1: EXPLAIN ANALYSE TIMING TRUE SELECT 1;
^
mpf2=# EXPLAIN ANALYSE TIMING SELECT 1;
ERROR:  syntax error at or near TIMING
LINE 1: EXPLAIN ANALYSE TIMING SELECT 1;
^
mpf2=# EXPLAIN TIMING ANALYSE SELECT 1;
ERROR:  syntax error at or near TIMING
LINE 1: EXPLAIN TIMING ANALYSE SELECT 1;
^
mpf2=# EXPLAIN (TIMING) ANALYSE SELECT 1;
ERROR:  syntax error at or near ANALYSE
LINE 1: EXPLAIN (TIMING) ANALYSE SELECT 1;
 ^
mpf2=# EXPLAIN (TIMING) ANALYZE SELECT 1;
ERROR:  syntax error at or near ANALYZE
LINE 1: EXPLAIN (TIMING) ANALYZE SELECT 1;
 ^
mpf2=# EXPLAIN (TIMING FALSE) ANALYZE SELECT 1;
ERROR:  syntax error at or near ANALYZE
LINE 1: EXPLAIN (TIMING FALSE) ANALYZE SELECT 1;
   ^
mpf2=# SELECT version();
   version


--
 PostgreSQL 9.2.3 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.6
20120
305 (Red Hat 4.4.6-4), 64-bit
(1 row)


Thanks
--
Robins Tharakan

Re: [HACKERS] How to use EXPLAIN (TIMING)

2013-04-12 Thread Andres Freund

On 2013-04-12 20:44:02 +0530, Robins Tharakan wrote:
 Hi,
 
 While creating regression tests for EXPLAIN I am (somehow) unable to get
 (TIMING) option to work with EXPLAIN!
 
 I must be doing something stupid but all these options below just didn't
 work. Could someone point me to the right direction?

EXPLAIN (ANALYZE, TIMING on/off) ...;

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] after 9.2.4 patch vacuumdb -avz not analyzing all tables

2013-04-12 Thread Kevin Grittner

Scott Marlowe scott.marl...@gmail.com wrote:

 Does this behavior only affect the 9.2 branch? Or was it ported
 to 9.1 or 9.0 or 8.4 as well?

After leaving it on master for a while to see if anyone reported
problems in development, I back-patched as far as 9.0 in time for
the 9.2.3 (and related) patches.  Prior to that the code was too
different for it to be the same patch, and (perhaps not entirely
coincidentally) I had not seen the problems before 9.0.  From 9.0
on I have seen multiple sites (all using queuing from Slony or a
JMS implementation) with recurring problems when the queue
temporarily got large, shrank again, and then wrapped around to the
beginning of the table's file space.  In some cases performance was
so impaired that when such an event was triggered they would shut
down their application until a manual VACUUM could be run.

-- 
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pg_regress and non-default unix socket path

2013-04-12 Thread Alvaro Herrera

Robert Haas escribió:
 On Tue, Apr 9, 2013 at 8:08 AM, Christoph Berg c...@df7cb.de wrote:
  Debian has been patching pg_regress for years because our default unix
  socket directory is /var/run/postgresql, but that is not writable by
  the build user at build time. This used to be a pretty ugly make-
  patch-make check-unpatch-make install patch dance, but now it is a
  pretty patch that makes pg_regress accept --host=/tmp.
 
  The patch is already part of the .deb files on apt.postgresql.org and
  passes all regression test suites there.
 
  Please consider it for 9.3. (And maybe backpatch it all the way...)
 
 The hunk that changes the messages might need some thought so that it
 doesn't cause a translation regression.

FWIW I don't think we translate pg_regress at all, and I have my doubts
that it makes much sense.  I'd go as far as suggest we get rid of the _()
decorations in the lines we're changing.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Analyzing bug 8049

2013-04-12 Thread Dickson S. Guedes

Em Sex, 2013-04-12 às 10:58 -0400, Tom Lane escreveu:
 Robert Haas robertmh...@gmail.com writes:
  On Thu, Apr 11, 2013 at 1:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  The plan I'm considering is to get this written and committed to HEAD
  in the next week, so that it can go out in 9.3beta1.  After the patch
  has survived a reasonable amount of beta testing, I'd be more comfortable
  about back-patching into 9.2.
 
  I'm not very sanguine about the chances that back-patching this won't
  provoke any screams of agony ... but I don't have a better idea,
  either.  Letting queries return wrong answers isn't a superior
  solution, for sure.
 
 The only alternative I can see is to make a back-patch that just teaches
 get_eclass_for_sort_expr() to compute valid nullable_relids for the sort
 expression.  


In my tests, after ANALYZE _bug_header and _bug_line, the query plan
changes and query results returns as expected. Is this a chance that
things isn't too bad?


[]s
-- 
Dickson S. Guedes
mail/xmpp: gue...@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br
http://www.rnp.br/keyserver/pks/lookup?search=0x8F3E3C06D428D10A


signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] Patch to make pgindent work cleanly

2013-04-12 Thread Bruce Momjian

On Tue, Feb 19, 2013 at 04:50:45PM -0500, Gurjeet Singh wrote:
 Please find attached the patch for some cleanup and fix bit rot in pgindent
 script.
 
 There were a few problems with the script.
 
 .) It failed to use the $ENV{PGENTAB} even if it was set.
 .) The file it tries to download from Postgres' ftp site is no longer present.
 ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz
 .) The tar command extracted the above-mentioned file to a child directory, 
 but
 did not descend into it before running make on it.
 I think it expected a tarbomb, but clearly the current .tgz file on ftp
 site is not a tarbomb.
 
 I don't like the fact that it dies with a message fetching xyz rather than
 saying Could not fetch xyz, but this patch does not address that since it
 doesn't really affect the output when script does work.
 
 With this patch in place I could very easily run it on any arbitrary file,
 which seemed like a black-magic before the patch.
 
 src/tools/pgindent/pgindent --build your file path here

I have applied this patch.  However, I modified the tarball name to
reference $INDENT_VERSION, rather than hard-coding 1.2.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Andrew Dunstan



On 04/12/2013 10:15 AM, Tom Lane wrote:

Sameer Thakur samthaku...@gmail.com writes:

The proposed tool tries to make migration faster for tables and indices
only by copying their binary data files.

There's 0 chance of making that work, because the two databases wouldn't
have the same notions of committed XIDs.



Yeah. Trying to think way outside the box, could we invent some sort of 
fixup mechanism that could be applied to adopted files? Of course, that 
could slow things down so much that it wouldn't be worth it, but it 
might be a nice research project.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Analyzing bug 8049

2013-04-12 Thread Tom Lane

Dickson S. Guedes lis...@guedesoft.net writes:
 In my tests, after ANALYZE _bug_header and _bug_line, the query plan
 changes and query results returns as expected. Is this a chance that
 things isn't too bad?

No, it just means that in this particular scenario, the bug only
manifests if a nestloop plan is chosen --- without that, there's no
possibility of pushing a join clause down to the scan level anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Kevin Grittner

Jeff Janes jeff.ja...@gmail.com wrote:

 If we're going to have the message, we should make it useful.
 My biggest question here is not whether we should add this info,
 but whether it should be DEBUG instead of LOG

 I like it being LOG.  If it were DEBUG, I don't think anyone
 would be likely to see it when they needed to, as it happens
 sporadically on busy servers and I don't think people would run
 those with DEBUG on.  I figure it is analogous to an autovacuum
 cancel message it partially replaces, and those are LOG.

OK, sold.

 Also, I think that permanently boycotting doing autoanalyze
 because someone is camping out on an access share lock (or
 because there are a never-ending stream of overlapping locks)
 and so the truncation cannot be done is a bit drastic,
 especially for inclusion in a point release.

 That much is not a change in the event that the truncation does
 not complete.  

 OK, I see that now.  In the old behavior, of the lock was
 acquired, but then we were shoved off from it, the analyze was
 not done.  But, in the old behavior if the lock was never
 acquired at all, then it would go ahead to do the autoanalyze,
 and that has changed.   That is they way I was testing it
 (camping out on an access shared lock so the access exclusive
 could never be granted in the first place; because intercepting
 it during the truncate phase was hard to do) and I just assumed
 the behavior I saw would apply to both cases, but it does not.

Ah, I see now.  So the actual worst case for the old code, in terms
of both head-banging and statistics, was if autovacuum was able to
acquire the lock but then many tasks all piled up behind its lock.
If the system was even *more* busy it would not acquire the lock at
all, and would behave better.

 I have seen cases where the old logic head-banged for
 hours or days without succeeding at the truncation attempt in
 autovacuum, absolutely killing performance until the user ran an
 explicit VACUUM.  And in the meantime, since the deadlock
 detection logic was killing autovacuum before it got to the
 analyze phase, the autoanalyze was never done.

 OK, so there three problems.  It would take a second to yield, in
 doing so it would abandon all the progress it had made in that
 second rather than saving it, and it would tight loop (restricted
 by naptime) on this because of the lack of analyze.  So it fixed
 the first two in a way that seems an absolute improvement for the
 auto case, but it made the third one worse in a common case,
 where it never acquires the lock in the first place, and so
 doesn't analyze when before it did in that one case.

Yeah, I see that now.

 Perhaps the new logic should go ahead and get its lock even on a
 busy system (like the old logic),

 As far as I can tell, the old logic was always conditional on the
 AccessExlusive lock acquisition, whether it was manual or auto.

OK, will review that to confirm;but assuming that's right, and
nobody else is already working on a fix, I propose to do the
following:

(1)  Restore the prior behavior of the VACUUM command.  This was
only ever intended to be a fix for a serious autovacuum problem
which caused many users serious performance problems -- in some
cases including unscheduled down time.  I also saw sites where,
having been bitten by this, they disabled autovacuum and later ran
into problems with bloat and/or xid wraparound.

(2)  If autovacuum decides to try to truncate but the lock cannot
be initially acquired, and analyze is requested, skip the
truncation and do the autoanalyze.  If the table is so hot that we
cannot get the lock, the space may get re-used soon, and if not
there is a good chance another autovacuum will trigger soon.  If
the user really wants the space released to the OS immediately,
they can run a manual vacuum to force the issue.

If I don't hear anything within the next day or two, I'll write
that up and post it here before applying (and back-patching to
affected branches).

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Alvaro Herrera

Andrew Dunstan escribió:
 
 On 04/12/2013 10:15 AM, Tom Lane wrote:
 Sameer Thakur samthaku...@gmail.com writes:
 The proposed tool tries to make migration faster for tables and indices
 only by copying their binary data files.
 There's 0 chance of making that work, because the two databases wouldn't
 have the same notions of committed XIDs.
 
 Yeah. Trying to think way outside the box, could we invent some sort
 of fixup mechanism that could be applied to adopted files? Of
 course, that could slow things down so much that it wouldn't be
 worth it, but it might be a nice research project.

I think the fixup procedure involves freezing Xids (prior to the
transporting), which the OP said he didn't want to do.

If you don't freeze beforehand, there's not enough info in the new
cluster to know which tuples are dead/alive.  Another option would be to
have a private copy of pg_clog/pg_subtrans for the transported
table(s), but that seems very difficult to arrange.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Tom Lane

Andrew Dunstan and...@dunslane.net writes:
 On 04/12/2013 10:15 AM, Tom Lane wrote:
 There's 0 chance of making that work, because the two databases wouldn't
 have the same notions of committed XIDs.

 Yeah. Trying to think way outside the box, could we invent some sort of 
 fixup mechanism that could be applied to adopted files?

Well, it wouldn't be that hard to replace XIDs with FrozenXID or
InvalidXID as appropriate, if you had access to the source database's
clog while you did the copying.  It just wouldn't be very fast.

I suppose it would still be faster than a COPY transfer, but I'm not
sure it'd be enough faster to justify the work and the additional
portability hits you'd be taking.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Andres Freund

On 2013-04-12 12:14:24 -0400, Tom Lane wrote:
 Andrew Dunstan and...@dunslane.net writes:
  On 04/12/2013 10:15 AM, Tom Lane wrote:
  There's 0 chance of making that work, because the two databases wouldn't
  have the same notions of committed XIDs.
 
  Yeah. Trying to think way outside the box, could we invent some sort of 
  fixup mechanism that could be applied to adopted files?
 
 Well, it wouldn't be that hard to replace XIDs with FrozenXID or
 InvalidXID as appropriate, if you had access to the source database's
 clog while you did the copying.  It just wouldn't be very fast.

I think if one goes over the heap and hint bits everything (so the item
pointers don't have to be immediately rewritten), freeze everything and
such it should be doable at about disk speed unless you have a really
fast disk subsystem.
But it still is fairly complicated and I doubt its really necessary.

 I suppose it would still be faster than a COPY transfer, but I'm not
 sure it'd be enough faster to justify the work and the additional
 portability hits you'd be taking.

Using binary copy might already give quite a speedup, Sameer, did you
try that?

Also, do you really need parts of a cluster or would a base backup of
the whole cluster do the trick?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Pavan Deolasee

On Fri, Apr 12, 2013 at 9:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Andrew Dunstan and...@dunslane.net writes:
  On 04/12/2013 10:15 AM, Tom Lane wrote:
  There's 0 chance of making that work, because the two databases wouldn't
  have the same notions of committed XIDs.

  Yeah. Trying to think way outside the box, could we invent some sort of
  fixup mechanism that could be applied to adopted files?

 Well, it wouldn't be that hard to replace XIDs with FrozenXID or
 InvalidXID as appropriate, if you had access to the source database's
 clog while you did the copying.  It just wouldn't be very fast.


Would it be possible to fix the XIDs *after* copying the data files,
potentially on a different server so as to avoid any additional overhead on
the main server ? I guess so, though we will probably need some mechanism
to lock out access to the table (which seems easy), flush all its data
pages to the disk and some way to reliably flush all clog pages as well so
that they can be copied along with the data files. The page LSNs seem to be
easy to handle and can be easily zeroed out outside the server.

I wonder though if this all look like a material for something like
pg_reorg(pack) though some kind of support from the core may be required.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: [HACKERS] [PATCH] pg_regress and non-default unix socket path

2013-04-12 Thread Tom Lane

Robert Haas robertmh...@gmail.com writes:
 The hunk that changes the messages might need some thought so that it
 doesn't cause a translation regression.  But in general I see no
 reason not to do this before we release beta1.  It seems safe enough,
 and changes that reduce the need for packagers to carry private
 patches are, I think, generally a good thing.

It looks to me like this is asking for pg_regress to adopt a nonstandard
interpretation of PGHOST, which doesn't seem like a wise thing at all,
especially if it's not documented.

FWIW, the equivalent thing in the Red Hat/Fedora packages can be seen
in this patch:

http://pkgs.fedoraproject.org/cgit/postgresql.git/plain/postgresql-var-run-socket.patch

which would not get noticeably shorter if we hacked pg_regress in the
suggested way.  AFAICT, instead of touching pg_regress.c, Red Hat's
patch would need to do something to the regression Makefiles if we
wanted to use this implementation.  I'm not convinced that'd be better
at all.  TBH, if this is committed, the Red Hat patches will probably
end up reverting it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 10:22:38PM +0530, Pavan Deolasee wrote:
 
 
 
 On Fri, Apr 12, 2013 at 9:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 
 Andrew Dunstan and...@dunslane.net writes:
  On 04/12/2013 10:15 AM, Tom Lane wrote:
  There's 0 chance of making that work, because the two databases 
 wouldn't
  have the same notions of committed XIDs.
 
  Yeah. Trying to think way outside the box, could we invent some sort of
  fixup mechanism that could be applied to adopted files?
 
 Well, it wouldn't be that hard to replace XIDs with FrozenXID or
 InvalidXID as appropriate, if you had access to the source database's
 clog while you did the copying.  It just wouldn't be very fast.
 
 
 
 Would it be possible to fix the XIDs *after* copying the data files,
 potentially on a different server so as to avoid any additional overhead on 
 the
 main server ? I guess so, though we will probably need some mechanism to lock
 out access to the table (which seems easy), flush all its data pages to the
 disk and some way to reliably flush all clog pages as well so that they can be
 copied along with the data files. The page LSNs seem to be easy to handle and
 can be easily zeroed out outside the server.
 
 I wonder though if this all look like a material for something like pg_reorg
 (pack) though some kind of support from the core may be required.

Uh, now that you mention it, pg_upgrade in non-link mode does
something similer, in that it copies the data files and clog.  You could
use pg_upgrade in non-link mode, run VACUUM FREEZE on the upgraded
cluster, and then copy the data files.

The only problem is that pg_upgrade can't upgrade tablespaces with the
same system catalog version because the tablespace directory names would
conflict.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Tom Lane

Kevin Grittner kgri...@ymail.com writes:
 OK, will review that to confirm;but assuming that's right, and
 nobody else is already working on a fix, I propose to do the
 following:

 (1)  Restore the prior behavior of the VACUUM command.  This was
 only ever intended to be a fix for a serious autovacuum problem
 which caused many users serious performance problems -- in some
 cases including unscheduled down time.  I also saw sites where,
 having been bitten by this, they disabled autovacuum and later ran
 into problems with bloat and/or xid wraparound.

 (2)  If autovacuum decides to try to truncate but the lock cannot
 be initially acquired, and analyze is requested, skip the
 truncation and do the autoanalyze.  If the table is so hot that we
 cannot get the lock, the space may get re-used soon, and if not
 there is a good chance another autovacuum will trigger soon.  If
 the user really wants the space released to the OS immediately,
 they can run a manual vacuum to force the issue.

I think that the minimum appropriate fix here is to revert the hunk
I quoted, ie take out the suppression of stats reporting and analysis.

However, we're still thinking too small.  I've been wondering whether we
couldn't entirely remove the dirty, awful kluges that were installed in
the lock manager to kill autovacuum when somebody blocked behind it.
This mechanism should ensure that AV never takes an exclusive lock
for long enough to be a serious problem, so do we need that anymore?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Andres Freund

On 2013-04-12 13:09:02 -0400, Tom Lane wrote:
 Kevin Grittner kgri...@ymail.com writes:
  OK, will review that to confirm;but assuming that's right, and
  nobody else is already working on a fix, I propose to do the
  following:
 
  (1)� Restore the prior behavior of the VACUUM command.� This was
  only ever intended to be a fix for a serious autovacuum problem
  which caused many users serious performance problems -- in some
  cases including unscheduled down time.� I also saw sites where,
  having been bitten by this, they disabled autovacuum and later ran
  into problems with bloat and/or xid wraparound.
 
  (2)� If autovacuum decides to try to truncate but the lock cannot
  be initially acquired, and analyze is requested, skip the
  truncation and do the autoanalyze.� If the table is so hot that we
  cannot get the lock, the space may get re-used soon, and if not
  there is a good chance another autovacuum will trigger soon.� If
  the user really wants the space released to the OS immediately,
  they can run a manual vacuum to force the issue.
 
 I think that the minimum appropriate fix here is to revert the hunk
 I quoted, ie take out the suppression of stats reporting and analysis.
 
 However, we're still thinking too small.  I've been wondering whether we
 couldn't entirely remove the dirty, awful kluges that were installed in
 the lock manager to kill autovacuum when somebody blocked behind it.
 This mechanism should ensure that AV never takes an exclusive lock
 for long enough to be a serious problem, so do we need that anymore?

Wouldn't that make DROP TABLE stop working while autovac is processing
the table? Considering how long e.g. a full table vacuum on a huge table
can take that doesn't seem to be acceptable.
Sure, you can manually kill the autovac process, but that will soon be
restarted. So you have to know that you need to start the DROP TABLE
beforehand so it will get the lock quicker and such.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Object oriented programming language native to EAV

2013-04-12 Thread Sergei Sheinin

Programming language environment whose parser and interpreter are written
with plpgsql. Proof of concept prototype has been tested. An object
oriented programming language implemented with self-describing Entity
Attribute Value model that stores objects and object metadata descriptions.
The environment has features of a column-oriented database. High level
language commands are mapped to sets of key-value pairs that comprise low
level language code. http://sproutpl.wordpress.com

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Kevin Grittner

Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:
 OK, will review that to confirm;but assuming that's right, and
 nobody else is already working on a fix, I propose to do the
 following:

 (1)  Restore the prior behavior of the VACUUM command.  This was
 only ever intended to be a fix for a serious autovacuum problem
 which caused many users serious performance problems -- in some
 cases including unscheduled down time.  I also saw sites where,
 having been bitten by this, they disabled autovacuum and later ran
 into problems with bloat and/or xid wraparound.

 (2)  If autovacuum decides to try to truncate but the lock cannot
 be initially acquired, and analyze is requested, skip the
 truncation and do the autoanalyze.  If the table is so hot that we
 cannot get the lock, the space may get re-used soon, and if not
 there is a good chance another autovacuum will trigger soon.  If
 the user really wants the space released to the OS immediately,
 they can run a manual vacuum to force the issue.

 I think that the minimum appropriate fix here is to revert the hunk
 I quoted, ie take out the suppression of stats reporting and analysis.

I'm not sure I understand -- are you proposing that is all we do
for both the VACUUM command and autovacuum?  (i.e., we don't try to
full restore the old VACUUM command behavior; just the troublesome
failure to generate statistics?)

 However, we're still thinking too small.  I've been wondering whether we
 couldn't entirely remove the dirty, awful kluges that were installed in
 the lock manager to kill autovacuum when somebody blocked behind it.
 This mechanism should ensure that AV never takes an exclusive lock
 for long enough to be a serious problem, so do we need that anymore?

Are you suggesting that just in master/HEAD or back to 9.0?  If the
latter, what existing problem does it cure (besides ugly code)?

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Tom Lane

Andres Freund and...@2ndquadrant.com writes:
 On 2013-04-12 13:09:02 -0400, Tom Lane wrote:
 However, we're still thinking too small.  I've been wondering whether we
 couldn't entirely remove the dirty, awful kluges that were installed in
 the lock manager to kill autovacuum when somebody blocked behind it.
 This mechanism should ensure that AV never takes an exclusive lock
 for long enough to be a serious problem, so do we need that anymore?

 Wouldn't that make DROP TABLE stop working while autovac is processing
 the table?

Meh ... I guess you're right.  I wasn't thinking about exclusive locks
being taken elsewhere.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian

On Wed, Apr 10, 2013 at 11:19:56AM -0700, Jeff Davis wrote:
 On Wed, 2013-04-10 at 11:01 +0300, Ants Aasma wrote:
  I think we should first deal with using it for page checksums and if
  future versions want to reuse some of the code for WAL checksums then
  we can rearrange the code.
 
 Sounds good to me, although I expect we at least want any assembly to be
 in a separate file (if the specialization makes it in 9.3).

Sounds good.  Simon has done a good job shepherding this to completion. 

My only question is whether the 16-bit page checksums stored in WAL
reduce our ability to detect failed/corrupt writes to WAL?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [ADMIN] after 9.2.4 patch vacuumdb -avz not analyzing all tables

2013-04-12 Thread Mike Broers

Looks like psql vacuum (verbose, analyze) is not reflecting in
pg_stat_user_tables as well in some cases.  In this scenario I run the
command, it outputs all the deleted pages etc (unlike the vacuumdb -avz
analyze that seemed to be skipped in the log), but it does not update
pg_stat_user_tables.  Thats probably expected based on the description
previously reported, but I wanted to confirm what I was seeing.


On Fri, Apr 12, 2013 at 10:36 AM, Kevin Grittner kgri...@ymail.com wrote:

 Scott Marlowe scott.marl...@gmail.com wrote:

  Does this behavior only affect the 9.2 branch? Or was it ported
  to 9.1 or 9.0 or 8.4 as well?

 After leaving it on master for a while to see if anyone reported
 problems in development, I back-patched as far as 9.0 in time for
 the 9.2.3 (and related) patches.  Prior to that the code was too
 different for it to be the same patch, and (perhaps not entirely
 coincidentally) I had not seen the problems before 9.0.  From 9.0
 on I have seen multiple sites (all using queuing from Slony or a
 JMS implementation) with recurring problems when the queue
 temporarily got large, shrank again, and then wrapped around to the
 beginning of the table's file space.  In some cases performance was
 so impaired that when such an event was triggered they would shut
 down their application until a manual VACUUM could be run.

 --
 Kevin Grittner
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

Re: [HACKERS] Patch to make pgindent work cleanly

2013-04-12 Thread Gurjeet Singh

On Fri, Apr 12, 2013 at 11:44 AM, Bruce Momjian br...@momjian.us wrote:

 On Tue, Feb 19, 2013 at 04:50:45PM -0500, Gurjeet Singh wrote:
  Please find attached the patch for some cleanup and fix bit rot in
 pgindent
  script.
 
  There were a few problems with the script.
 
  .) It failed to use the $ENV{PGENTAB} even if it was set.
  .) The file it tries to download from Postgres' ftp site is no longer
 present.
  ftp://ftp.postgresql.org/pub/dev/indent.netbsd.patched.tgz
  .) The tar command extracted the above-mentioned file to a child
 directory, but
  did not descend into it before running make on it.
  I think it expected a tarbomb, but clearly the current .tgz file on
 ftp
  site is not a tarbomb.
 
  I don't like the fact that it dies with a message fetching xyz rather
 than
  saying Could not fetch xyz, but this patch does not address that since
 it
  doesn't really affect the output when script does work.
 
  With this patch in place I could very easily run it on any arbitrary
 file,
  which seemed like a black-magic before the patch.
 
  src/tools/pgindent/pgindent --build your file path here

 I have applied this patch.  However, I modified the tarball name to
 reference $INDENT_VERSION, rather than hard-coding 1.2.


Thanks!

Can you also improve the output when it dies upon failure to fetch
something? Currently the only error message it emits is fetching xyz, and
leaves the user confused as to what really the problem was. The only
indication of a problem might be the exit code,  but I'm not sure of that,
and that doesn't help the interactive user running it on terminal.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Tom Lane

Kevin Grittner kgri...@ymail.com writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 I think that the minimum appropriate fix here is to revert the hunk
 I quoted, ie take out the suppression of stats reporting and analysis.

 I'm not sure I understand -- are you proposing that is all we do
 for both the VACUUM command and autovacuum?

No, I said that was the minimum fix.

Looking again at the patch, I note this comment:

   /*
+* We failed to establish the lock in the specified number of
+* retries. This means we give up truncating. Suppress the
+* ANALYZE step. Doing an ANALYZE at this point will reset the
+* dead_tuple_count in the stats collector, so we will not get
+* called by the autovacuum launcher again to do the truncate.
+*/

and I suppose the rationale for suppressing the stats report was this
same idea of lying to the stats collector in order to encourage a new
vacuum attempt to happen right away.  Now I'm not sure that that's a
good idea at all --- what's the reasoning for thinking the table will be
any less hot in thirty seconds?  But if it is reasonable, we need a
redesign of the reporting messages, not just a hack to not tell the
stats collector what we did.

Are you saying you intend to revert that whole concept?  That'd be
okay with me, I think.  Otherwise we need some thought about how to
inform the stats collector what's really happening.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Alvaro Herrera

Tom Lane escribió:

 Are you saying you intend to revert that whole concept?  That'd be
 okay with me, I think.  Otherwise we need some thought about how to
 inform the stats collector what's really happening.

Maybe what we need is to consider table truncation as a separate
activity from vacuuming.  Then autovacuum could invoke it without having
to do a full-blown vacuum.  For this to work, I guess we would like to
separately store the status of the back-scan in pgstat somehow (I think
a boolean flag suffices: were we able to truncate all pages that
appeared to be empty?)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Well, it wouldn't be that hard to replace XIDs with FrozenXID or
 InvalidXID as appropriate, if you had access to the source database's
 clog while you did the copying.  It just wouldn't be very fast.

If you're doing that in a streaming method, it strikes me that it'd be
plenty fast.

 I suppose it would still be faster than a COPY transfer, but I'm not
 sure it'd be enough faster to justify the work and the additional
 portability hits you'd be taking.

The big win here over a binary COPY is pulling through the indexes as-is
as well- without having to rebuild them.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Tom Lane

Stephen Frost sfr...@snowman.net writes:
 * Tom Lane (t...@sss.pgh.pa.us) wrote:
 I suppose it would still be faster than a COPY transfer, but I'm not
 sure it'd be enough faster to justify the work and the additional
 portability hits you'd be taking.

 The big win here over a binary COPY is pulling through the indexes as-is
 as well- without having to rebuild them.

Meh.  That raises the ante another substantial multiple with respect to
the amount of portability risk (eg, you're now absolutely dependent on
locale sort orders to be identical in both databases).  And I think
you'd have to freeze all updates to the table while you were copying the
table+indexes, if you wanted them to be consistent.

I can't imagine that we'd accept a patch that says to the recipient
database, here are some large binary blobs, please believe that
they represent a valid table and associated indexes.  Oh, and don't you
dare try to actually check them, because that would be slow.

Some other interesting things to think about here would be toast-table
OIDs embedded in toast pointers, data type OIDs embedded in arrays (and
maybe records too, I forget), enum value OIDs, btree vacuum cycle IDs,
GiST NSNs ... not sure what else, but I bet that's not a complete list.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Kevin Grittner

[some relevant dropped bits of the thread restored]

Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:
 Jeff Janes jeff.ja...@gmail.com wrote:

 I propose to do the following:

 (1)  Restore the prior behavior of the VACUUM command.  This
 was only ever intended to be a fix for a serious autovacuum
 problem which caused many users serious performance problems

 (2)  If autovacuum decides to try to truncate but the lock
 cannot be initially acquired, and analyze is requested, skip
 the truncation and do the autoanalyze.

 I think that the minimum appropriate fix here is to [...] take
 out the suppression of stats reporting and analysis.
 
 I'm not sure I understand -- are you proposing that is all we do
 for both the VACUUM command and autovacuum?
 
 No, I said that was the minimum fix.

OK, I suggested that and more, so I wasn't sure what you were
getting at.

 OK, I see that now.  In the old behavior, of the lock was
 acquired, but then we were shoved off from it, the analyze
 was not done.  But, in the old behavior if the lock was never
 acquired at all, then it would go ahead to do the
 autoanalyze,

 Ah, I see now.  So the actual worst case for the old code, in
 terms of both head-banging and statistics, was if autovacuum
 was able to acquire the lock but then many tasks all piled up
 behind its lock.  If the system was even *more* busy it would
 not acquire the lock at all, and would behave better.

 and I suppose the rationale for suppressing the stats report was
 this same idea of lying to the stats collector in order to
 encourage a new vacuum attempt to happen right away.

I think Jan expressed some such sentiment back during the original
discussion.  I was not persuaded by that; but he pointed out that
if the deadlock killer killed an autovacuum process which was doing
a truncate, the it did not get to the statistics phase; so I agreed
that any change in that behavior should be a separate patch.  I
missed the fact that if it failed to initially get the lock it did
proceed to the statistics phase.  I explained this earlier in this
thread.  No need to cast about for hypothetical explanations.

 Now I'm not sure that that's a good idea at all

I'm pretty sure it isn't; that's why I proposed changing it.

 But if it is reasonable, we need a redesign of the reporting
 messages, not just a hack to not tell the stats collector what we
 did.

The idea was to try to make as small a change in previous behavior
as possible.  Jan pointed out that when the deadlock detection code
killed an autovacuum worker which was trying to truncate, the
statistics were not updated, leading to retries.  This was an
attempt to *not change* existing behavior.  It was wrong, because
we both missed the fact that if it didn't get the lock in the first
place it went ahead with statistics generation.  That being the
case, I was proposing we always generate statistics if we were
supposed to.  That would be a change toward *more* up-to-date
statistics and *fewer* truncation retries than we've had.  I'm OK
with that because a table hot enough to hit the issue will likely
need the space again or need another vacuum soon.

 Are you saying you intend to revert that whole concept?

No.  I was merely asking what you were suggesting.  As I said
earlier:

 I have seen cases where the old logic head-banged for hours or
 days without succeeding at the truncation attempt in
 autovacuum, absolutely killing performance until the user ran
 an explicit VACUUM.  And in the meantime, since the deadlock
 detection logic was killing autovacuum before it got to the
 analyze phase, the autoanalyze was never done.

 Otherwise we need some thought about how to inform the stats
 collector what's really happening.

I think we can probably improve that on some future release.  I
don't think a new scheme for that makes sense for back-patching or
9.3.

For now what I'm suggesting is generating statistics in all the
cases it did before, plus the case where it starts truncation but
does not complete it.  The fact that before this patch there were
cases where the autovacuum worker was killed, resulting in not
generating needed statistics seems like a bug, not a behavior we
need to preserve.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sepgsql 2/3] Add db_schema:search permission checks

2013-04-12 Thread Kohei KaiGai

2013/4/12 Robert Haas robertmh...@gmail.com:
 On Fri, Apr 12, 2013 at 10:42 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
 Robert Haas escribió:
 On Mon, Apr 8, 2013 at 12:28 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:

  Also, the attached function-execute-permission patch is a rebased
  version. I rethought its event name should be OAT_FUNCTION_EXECUTE,
  rather than OAT_FUNCTION_EXEC according to the manner without
  abbreviation. Other portion is same as previous ones.

 Great.  This looks fine to me, committed.  I also committed your
 getObjectIdentity patch, which caused some regression test output
 changes.  I applied the necessary correction before committing, I
 think, but please check.

 I think the function-execute code path is still using
 getObjectDescription rather than identity.

 Yeah, I think so.  KaiGai, can you provide a follow-on patch to fix that?

Yes, of course. The attached one replaces the getObjectDescription in
sepgsql/proc.c, and relative changes in regression test.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


sepgsql-v9.3-replace-get-object-description.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sepgsql 2/3] Add db_schema:search permission checks

2013-04-12 Thread Kohei KaiGai

 A problem regarding to validation of sepgsql-regtest policy module
 is originated by semodule commands that takes root privilege to
 list up installed policy modules. So, I avoided to use this command
 in the test_sepgsql script.
 However, I have an idea that does not raise script fail even if sudo
 semodule -l returned an error, except for a case when it can run
 correctly and the policy version is not expected one.
 How about your opinion for this check?

 Not sure that's too useful.  And I don't like the idea of putting sudo
 commands in a test harness script.  That seems too much like the sort
 of thing bad people do.

OK, I also doubt whether my idea make sense.

The attached patch omitted the portion to check the version of
sepgsql-regtest, and add some notice in the document instead.
Also, it moves current directory to the contrib/sepgsql on top of
the script, to avoid the problem when we run test_sepgsql
on the directory except for contring/sepgsql.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp


sepgsql-v9.3-test-script-fixup.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detach/attach table and index data files from one cluster to another

2013-04-12 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Stephen Frost sfr...@snowman.net writes:
  The big win here over a binary COPY is pulling through the indexes as-is
  as well- without having to rebuild them.

[... lots of reasons this is hard ...]

I agree that it's quite a bit more difficult, to the point that logical
replication which can be selective (eg: give me only table X + indexes)
might end up being the only answer, but otherwise this approach will
likely only be a modest improvement over binary COPY FREEZE- and there
only because we essentially end up skipping the type validation (which
we could just provide as an option, similar to COPY FREEZE...).

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Thu, 2013-04-11 at 20:12 +0100, Simon Riggs wrote:

 So, if we apply a patch like the one attached, we then end up with the
 WAL checksum using the page checksum as an integral part of its
 calculation. (There is no increase in code inside WALInsertLock,
 nothing at all touched in that area).
 
 
 Then all we need to do is make PageSetChecksumInplace() use Ants' algo
 and we're done.
 
 
 Only point worth discussing is that this change would make backup
 blocks be covered by a 16-bit checksum, not the CRC-32 it is now. i.e.
 the record header is covered by a CRC32 but the backup blocks only by
 16-bit. 

FWIW, that's fine with me. 

 (Attached patch is discussion only. Checking checksum in recovery
 isn't coded at all.)

I like it.

A few points:

* Given that setting the checksum is unconditional in a backup block, do
we want to zero the checksum field when the backup block is restored if
checksums are disabled? Otherwise we would have a strange situation
where some blocks have a checksum on disk even when checksums are
disabled.

* When we do PageSetChecksumInplace(), we need to be 100% sure that the
hole is empty; otherwise the checksum will fail when we re-expand it. It
might be worth a memset beforehand just to be sure.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] (auto)vacuum truncate exclusive lock

2013-04-12 Thread Tom Lane

Kevin Grittner kgri...@ymail.com writes:
 For now what I'm suggesting is generating statistics in all the
 cases it did before, plus the case where it starts truncation but
 does not complete it.  The fact that before this patch there were
 cases where the autovacuum worker was killed, resulting in not
 generating needed statistics seems like a bug, not a behavior we
 need to preserve.

Well, in the case where it gets killed, it's still not gonna generate
statistics.  What we've really got here is a new case that did not exist
before, namely that it voluntarily stops truncating.  But I agree that
modeling that case's behavior on the kill case was poorly thought out.

In other words, yes, I think we're on the same page.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 12:07:36PM -0700, Jeff Davis wrote:
  (Attached patch is discussion only. Checking checksum in recovery
  isn't coded at all.)
 
 I like it.
 
 A few points:
 
 * Given that setting the checksum is unconditional in a backup block, do
 we want to zero the checksum field when the backup block is restored if
 checksums are disabled? Otherwise we would have a strange situation
 where some blocks have a checksum on disk even when checksums are
 disabled.
 
 * When we do PageSetChecksumInplace(), we need to be 100% sure that the
 hole is empty; otherwise the checksum will fail when we re-expand it. It
 might be worth a memset beforehand just to be sure.

Do we write the page holes to the WAL for full-page writes?  I hope we
don't.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to make pgindent work cleanly

2013-04-12 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 01:34:49PM -0400, Gurjeet Singh wrote:
 Can you also improve the output when it dies upon failure to fetch something?
 Currently the only error message it emits is fetching xyz, and leaves the
 user confused as to what really the problem was. The only indication of a
 problem might be the exit code,  but I'm not sure of that, and that doesn't
 help the interactive user running it on terminal.

Good point.  I have reviewed all the error messages and improved them
with the attached, applied patch, e.g.:

cannot locate typedefs file xyz at /usr/local/bin/pgindent line 121.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/src/tools/pgindent/pgindent b/src/tools/pgindent/pgindent
index 2e9d443..73237ca 100755
--- a/src/tools/pgindent/pgindent
+++ b/src/tools/pgindent/pgindent
@@ -30,7 +30,7 @@ my %options = (
 	excludes=s  = \$excludes,
 	indent=s= \$indent,
 	build   = \$build,);
-GetOptions(%options) || die bad command line;
+GetOptions(%options) || die bad command line argument;
 
 run_build($code_base) if ($build);
 
@@ -118,10 +118,11 @@ sub load_typedefs
 		  if (-f $tdtry/src/tools/pgindent/typedefs.list);
 		$tdtry = $tdtry/..;
 	}
-	die no typedefs file unless $typedefs_file  -f $typedefs_file;
+	die cannot locate typedefs file \$typedefs_file\
+		unless $typedefs_file  -f $typedefs_file;
 
 	open(my $typedefs_fh, '', $typedefs_file)
-	  || die opening $typedefs_file: $!;
+	  || die cannot open typedefs file \$typedefs_file\: $!;
 	my @typedefs = $typedefs_fh;
 	close($typedefs_fh);
 
@@ -143,7 +144,7 @@ sub process_exclude
 {
 	if ($excludes  @files)
 	{
-		open(my $eh, '', $excludes) || die opening $excludes;
+		open(my $eh, '', $excludes) || die cannot open exclude file \$excludes\;
 		while (my $line = $eh)
 		{
 			chomp $line;
@@ -162,7 +163,7 @@ sub read_source
 	my $source;
 
 	open(my $src_fd, '', $source_filename)
-	  || die opening $source_filename: $!;
+	  || die cannot open file \$source_filename\: $!;
 	local ($/) = undef;
 	$source = $src_fd;
 	close($src_fd);
@@ -177,7 +178,7 @@ sub write_source
 	my $source_filename = shift;
 
 	open(my $src_fh, '', $source_filename)
-	  || die opening $source_filename: $!;
+	  || die cannot open file \$source_filename\: $!;
 	print $src_fh $source;
 	close($src_fh);
 }
@@ -436,25 +437,25 @@ sub run_build
 		$code_base = $code_base/..;
 	}
 
-	die no src/tools/pgindent directory in $code_base
+	die cannot locate src/tools/pgindent directory in \$code_base\
 	  unless -d $code_base/src/tools/pgindent;
 
 	chdir $code_base/src/tools/pgindent;
 
-	my $rv = getstore(http://buildfarm.postgresql.org/cgi-bin/typedefs.pl;,
-		tmp_typedefs.list);
+	my $typedefs_list_url = http://buildfarm.postgresql.org/cgi-bin/typedefs.pl;;
 
-	die fetching typedefs.list unless is_success($rv);
+	my $rv = getstore($typedefs_list_url, tmp_typedefs.list);
+
+	die cannot fetch typedefs list from $typedefs_list_url unless is_success($rv);
 
 	$ENV{PGTYPEDEFS} = abs_path('tmp_typedefs.list');
 
-	my $pg_bsd_indent_name = pg_bsd_indent- . $INDENT_VERSION . .tar.gz;
+	my $pg_bsd_indent_url = ftp://ftp.postgresql.org/pub/dev/pg_bsd_indent-; . 
+ $INDENT_VERSION . .tar.gz;
 
-	$rv =
-	  getstore(ftp://ftp.postgresql.org/pub/dev/$pg_bsd_indent_name;,
-		pg_bsd_indent.tgz);
+	$rv = getstore($pg_bsd_indent_url, pg_bsd_indent.tgz);
 
-	die fetching $pg_bsd_indent_name unless is_success($rv);
+	die cannot fetch BSD indent tarfile from $pg_bsd_indent_url unless is_success($rv);
 
 	# XXX add error checking here
 
@@ -484,7 +485,7 @@ sub build_clean
 		$code_base = $code_base/..;
 	}
 
-	die no src/tools/pgindent directory in $code_base
+	die cannot locate src/tools/pgindent directory in \$code_base\
 	  unless -d $code_base/src/tools/pgindent;
 
 	chdir $code_base;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Andres Freund

On 2013-04-11 20:12:59 +0100, Simon Riggs wrote:
 On 11 April 2013 04:27, Jeff Davis pg...@j-davis.com wrote:
 
  On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:
 
   OK, so we have a single combined calculate a checksum for a block
   function. That uses Jeff's zeroing trick and Ants' bulk-oriented
   performance optimization.
  
  
   For buffer checksums we simply calculate for the block.
 
  Sounds good.
 
   For WAL full page writes, we first set the checksums for all defined
   buffers, then calculate the checksum of remaining data plus the
   pd_checksum field from each block using the normal WAL CRC32.
  
   Seems good to me. One set of fast code. And it avoids the weirdness
   that the checksum stored on the full page is actually wrong.
 
  Oh, that's a nice benefit.
 
 
 So, if we apply a patch like the one attached, we then end up with the WAL
 checksum using the page checksum as an integral part of its calculation.
 (There is no increase in code inside WALInsertLock, nothing at all touched
 in that area).
 
 Then all we need to do is make PageSetChecksumInplace() use Ants' algo and
 we're done.
 
 Only point worth discussing is that this change would make backup blocks be
 covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
 header is covered by a CRC32 but the backup blocks only by 16-bit.

That means we will have to do the verification for this in
ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
won't always recognize the end of WAL correctly.
And I am a bit wary of reducing the likelihood of noticing the proper
end-of-recovery by reducing the crc width.

Why again are we doing this now? Just to reduce the overhead of CRC
computation for full page writes? Or are we forseeing issues with the
page checksums being wrong because of non-zero data in the hole being
zero after the restore from bkp blocks?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Bruce Momjian

On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:
  Only point worth discussing is that this change would make backup blocks be
  covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
  header is covered by a CRC32 but the backup blocks only by 16-bit.
 
 That means we will have to do the verification for this in
 ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
 won't always recognize the end of WAL correctly.
 And I am a bit wary of reducing the likelihood of noticing the proper
 end-of-recovery by reducing the crc width.
 
 Why again are we doing this now? Just to reduce the overhead of CRC
 computation for full page writes? Or are we forseeing issues with the
 page checksums being wrong because of non-zero data in the hole being
 zero after the restore from bkp blocks?

I thought the idea is that we were going to re-use the already-computed
CRC checksum on the page, and we only have 16-bits of storage for that.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to make pgindent work cleanly

2013-04-12 Thread Gurjeet Singh

On Fri, Apr 12, 2013 at 3:26 PM, Bruce Momjian br...@momjian.us wrote:

 On Fri, Apr 12, 2013 at 01:34:49PM -0400, Gurjeet Singh wrote:
  Can you also improve the output when it dies upon failure to fetch
 something?
  Currently the only error message it emits is fetching xyz, and leaves
 the
  user confused as to what really the problem was. The only indication of a
  problem might be the exit code,  but I'm not sure of that, and that
 doesn't
  help the interactive user running it on terminal.

 Good point.  I have reviewed all the error messages and improved them
 with the attached, applied patch, e.g.:

 cannot locate typedefs file xyz at /usr/local/bin/pgindent line
 121.


Thanks!

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Andres Freund

On 2013-04-12 15:31:36 -0400, Bruce Momjian wrote:
 On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:
   Only point worth discussing is that this change would make backup blocks 
   be
   covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
   header is covered by a CRC32 but the backup blocks only by 16-bit.
  
  That means we will have to do the verification for this in
  ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
  won't always recognize the end of WAL correctly.
  And I am a bit wary of reducing the likelihood of noticing the proper
  end-of-recovery by reducing the crc width.
  
  Why again are we doing this now? Just to reduce the overhead of CRC
  computation for full page writes? Or are we forseeing issues with the
  page checksums being wrong because of non-zero data in the hole being
  zero after the restore from bkp blocks?
 
 I thought the idea is that we were going to re-use the already-computed
 CRC checksum on the page, and we only have 16-bits of storage for that.

Well, but the proposal seems to be to do this also for non-checksum
enabled datadirs, so ...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Small reduction in memory usage of index relcache entries

2013-04-12 Thread Heikki Linnakangas

While debugging an out-of-memory situation, I noticed that the relcache 
entry of each index consumes 2k of memory. From a memory dump:



...
pg_database_datname_index: 2048 total in 1 blocks; 712 free (0 chunks); 
1336 used
pg_trigger_tgrelid_tgname_index: 2048 total in 1 blocks; 576 free (0 
chunks); 1472 used
pg_rewrite_rel_rulename_index: 2048 total in 1 blocks; 576 free (0 chunks); 
1472 used
pg_amproc_fam_proc_index: 2048 total in 1 blocks; 232 free (0 chunks); 1816 
used
pg_opclass_oid_index: 2048 total in 1 blocks; 664 free (0 chunks); 1384 used
pg_index_indexrelid_index: 2048 total in 1 blocks; 664 free (0 chunks); 
1384 used
pg_attribute_relid_attnum_index: 2048 total in 1 blocks; 528 free (0 
chunks); 1520 used
pg_class_oid_index: 2048 total in 1 blocks; 664 free (0 chunks); 1384 used


Looking closer at where that memory is spent, a lot of it goes into the 
FmgrInfo structs in RelationAmInfo. But some of them are outright unused 
(ambuild, ambuildempty, amcostestimate, amoptions), and a few others 
hardly are called so seldom that they hardly need to be cached 
(ambuildelete, ambacuumcleanup). Removing those fields, patch attached, 
reduces the memory usage nicely:



...
pg_database_datname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 
used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 64 free (0 
chunks); 960 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 64 free (0 chunks); 
960 used
pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1736 free (2 chunks); 
1336 used
pg_opclass_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 
used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 16 free (0 
chunks); 1008 used
pg_class_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used


This brings most index entries from two 1k blocks down to one 1k block. 
A backend seems to load about 60 indexes into the cache for system 
tables, so that saves about ~60 kB of memory for every backend at a minimum.


This isn't a huge difference, unless you access a huge number of 
indexes. But the patch is trivial, and every little helps. Any 
objections if I commit this to master?


- Heikki
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ce94649..b9e519a 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -671,14 +671,19 @@ index_bulk_delete(IndexVacuumInfo *info,
   void *callback_state)
 {
 	Relation	indexRelation = info-index;
-	FmgrInfo   *procedure;
+	RegProcedure procOid;
+	FmgrInfo	procedure;
 	IndexBulkDeleteResult *result;
 
 	RELATION_CHECKS;
-	GET_REL_PROCEDURE(ambulkdelete);
+
+	procOid = indexRelation-rd_am-ambulkdelete;
+	if (!RegProcedureIsValid(procOid))
+		elog(ERROR, invalid ambulkdelete regproc);
+	fmgr_info(procOid, procedure);
 
 	result = (IndexBulkDeleteResult *)
-		DatumGetPointer(FunctionCall4(procedure,
+		DatumGetPointer(FunctionCall4(procedure,
 	  PointerGetDatum(info),
 	  PointerGetDatum(stats),
 	  PointerGetDatum((Pointer) callback),
@@ -698,14 +703,19 @@ index_vacuum_cleanup(IndexVacuumInfo *info,
 	 IndexBulkDeleteResult *stats)
 {
 	Relation	indexRelation = info-index;
-	FmgrInfo   *procedure;
+	RegProcedure procOid;
+	FmgrInfo	procedure;
 	IndexBulkDeleteResult *result;
 
 	RELATION_CHECKS;
-	GET_REL_PROCEDURE(amvacuumcleanup);
+
+	procOid = indexRelation-rd_am-amvacuumcleanup;
+	if (!RegProcedureIsValid(procOid))
+		elog(ERROR, invalid amvacuumcleanup regproc);
+	fmgr_info(procOid, procedure);
 
 	result = (IndexBulkDeleteResult *)
-		DatumGetPointer(FunctionCall2(procedure,
+		DatumGetPointer(FunctionCall2(procedure,
 	  PointerGetDatum(info),
 	  PointerGetDatum(stats)));
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a4daf77..05f2011 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -47,8 +47,8 @@ typedef LockInfoData *LockInfo;
 
 
 /*
- * Cached lookup information for the index access method functions defined
- * by the pg_am row associated with an index relation.
+ * Cached lookup information for the frequently used index access method
+ * functions, defined by the pg_am row associated with an index relation.
  */
 typedef struct RelationAmInfo
 {
@@ -60,13 +60,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	amendscan;
 	FmgrInfo	ammarkpos;
 	FmgrInfo	amrestrpos;
-	FmgrInfo	ambuild;
-	FmgrInfo	ambuildempty;
-	FmgrInfo	ambulkdelete;
-	FmgrInfo	amvacuumcleanup;
 	FmgrInfo	amcanreturn;
-	FmgrInfo	amcostestimate;
-	FmgrInfo	amoptions;
 } RelationAmInfo;
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Heikki Linnakangas


On 12.04.2013 22:31, Bruce Momjian wrote:

On Fri, Apr 12, 2013 at 09:28:42PM +0200, Andres Freund wrote:

Only point worth discussing is that this change would make backup blocks be
covered by a 16-bit checksum, not the CRC-32 it is now. i.e. the record
header is covered by a CRC32 but the backup blocks only by 16-bit.


That means we will have to do the verification for this in
ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
won't always recognize the end of WAL correctly.
And I am a bit wary of reducing the likelihood of noticing the proper
end-of-recovery by reducing the crc width.

Why again are we doing this now? Just to reduce the overhead of CRC
computation for full page writes? Or are we forseeing issues with the
page checksums being wrong because of non-zero data in the hole being
zero after the restore from bkp blocks?


I thought the idea is that we were going to re-use the already-computed
CRC checksum on the page, and we only have 16-bits of storage for that.


No, the patch has to compute the 16-bit checksum for the page when the 
full-page image is added to the WAL record. There would otherwise be no 
need to calculate the page checksum at that point, but only later when 
the page is written out from shared buffer cache.


I think this is a bad idea. It complicates the WAL format significantly. 
Simon's patch didn't include the changes to recovery to validate the 
checksum, but I suspect it would be complicated. And it reduces the 
error-detection capability of WAL recovery. Keep in mind that unlike 
page checksums, which are never expected to fail, so even if we miss a 
few errors it's still better than nothing, the WAL checkum is used to 
detect end-of-WAL. There is expected to be a failure every time we do 
crash recovery. This far, we've considered the probability of one in 
1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.


If you want to speed up the CRC calculation of full-page images, you 
could have an optimized version of the WAL CRC algorithm, using e.g. 
SIMD instructions. Because typical WAL records are small, max 100-200 
bytes, and it consists of several even smaller chunks, the normal WAL 
CRC calculation is quite resistant to common optimization techniques. 
But it might work for the full-page images. Let's not conflate it with 
the page checksums, though.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Simon Riggs

On 12 April 2013 21:03, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 No, the patch has to compute the 16-bit checksum for the page when the
 full-page image is added to the WAL record. There would otherwise be no need
 to calculate the page checksum at that point, but only later when the page
 is written out from shared buffer cache.

 I think this is a bad idea. It complicates the WAL format significantly.
 Simon's patch didn't include the changes to recovery to validate the
 checksum, but I suspect it would be complicated. And it reduces the
 error-detection capability of WAL recovery. Keep in mind that unlike page
 checksums, which are never expected to fail, so even if we miss a few errors
 it's still better than nothing, the WAL checkum is used to detect
 end-of-WAL. There is expected to be a failure every time we do crash
 recovery. This far, we've considered the probability of one in 1^32 small
 enough for that purpose, but IMHO one in 1^16 is much too weak.

 If you want to speed up the CRC calculation of full-page images, you could
 have an optimized version of the WAL CRC algorithm, using e.g. SIMD
 instructions. Because typical WAL records are small, max 100-200 bytes, and
 it consists of several even smaller chunks, the normal WAL CRC calculation
 is quite resistant to common optimization techniques. But it might work for
 the full-page images. Let's not conflate it with the page checksums, though.

I accept the general tone of that as a reasonable perspective and in
many ways am on the fence myself. This is sensitive stuff.

A few points
* The code to validate the checksum isn't complex, though it is more
than the current one line. Lets say about 10 lines of clear code. I'll
work on that to show its true. I don't see that as a point of
objection.

* WAL checksum is not used as the sole basis for end-of-WAL discovery.
We reuse the WAL files, so the prev field in each WAL record shows
what the previous end of WAL was. Hence if the WAL checksums give a
false positive we still have a double check that the data really is
wrong. It's unbelievable that you'd get a false positive and then have
the prev field match as well, even though it was the genuine
end-of-WAL.

Yes, we could also have a second SIMD calculation optimised for WAL
CRC32 on an 8192 byte block, rather than just one set of SIMD code for
both. We could also have a single set of SIMD code producing a 32-bit
checksum, then take the low 16 bits as we do currently.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Fri, 2013-04-12 at 15:21 -0400, Bruce Momjian wrote:
  * When we do PageSetChecksumInplace(), we need to be 100% sure that the
  hole is empty; otherwise the checksum will fail when we re-expand it. It
  might be worth a memset beforehand just to be sure.
 
 Do we write the page holes to the WAL for full-page writes?  I hope we
 don't.

No, but the page hole is included in the checksum.

Let's say that the page hole contains some non-zero value, and we
calculate a checksum. When we eliminate the page hole, and then
reconstitute the page using zeros for the page hole later, then the page
will not match the checksum any more.

So, we need to be sure the original page hole is all-zero when we
calculate the checksum.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Fri, 2013-04-12 at 21:28 +0200, Andres Freund wrote:
 That means we will have to do the verification for this in
 ValidXLogRecord() *not* in RestoreBkpBlock or somesuch. Otherwise we
 won't always recognize the end of WAL correctly.
 And I am a bit wary of reducing the likelihood of noticing the proper
 end-of-recovery by reducing the crc width.

Good point.

 Why again are we doing this now? Just to reduce the overhead of CRC
 computation for full page writes? Or are we forseeing issues with the
 page checksums being wrong because of non-zero data in the hole being
 zero after the restore from bkp blocks?

That shouldn't be a problem, because the block is not expected to have a
proper checksum in WAL, and it will be recalculated before being
written. So I see these changes as mostly independent.

The reason we're discussing right now is because, when choosing the
checksum algorithm, I was hoping that it might be usable in the future
for WAL backup blocks. I'm convinced that they can be; and the primary
question now seems to be should they be, which does not need to be
settled right now in my opinion.

Anyway, I would be perfectly happy if we just got the SIMD algorithm in
for data pages. The support for changing the WAL checksums seems
lukewarm, and there might be quite a few alternatives (e.g. optimizing
the CRC for backup blocks as Heikki suggested) to achieve that
performance goal.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Jeff Davis

On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote:
 I think this is a bad idea. It complicates the WAL format significantly. 
 Simon's patch didn't include the changes to recovery to validate the 
 checksum, but I suspect it would be complicated. And it reduces the 
 error-detection capability of WAL recovery. Keep in mind that unlike 
 page checksums, which are never expected to fail, so even if we miss a 
 few errors it's still better than nothing, the WAL checkum is used to 
 detect end-of-WAL. There is expected to be a failure every time we do 
 crash recovery. This far, we've considered the probability of one in 
 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.

One thing that just occurred to me is that we could make the SIMD
checksum a 32-bit checksum, and reduce it down to 16 bits for the data
pages. That might give us more flexibility to later use it for WAL
without compromising on the error detection nearly as much (though
obviously that wouldn't work with Simon's current proposal which uses
the same data page checksum in a WAL backup block).

In general, we have more flexibility with WAL because there is no
upgrade issue. It would be nice to share code with the data page
checksum algorithm; but really we should just use whatever offers the
best trade-off in terms of complexity, performance, and error detection
rate.

I don't think we need to decide all of this right now. Personally, I'm
satisfied having SIMD checksums on data pages now and leaving WAL
optimization until later.

Regards,
Jeff Davis





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-04-12 Thread Ants Aasma

On Sat, Apr 13, 2013 at 12:38 AM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-04-12 at 23:03 +0300, Heikki Linnakangas wrote:
 I think this is a bad idea. It complicates the WAL format significantly.
 Simon's patch didn't include the changes to recovery to validate the
 checksum, but I suspect it would be complicated. And it reduces the
 error-detection capability of WAL recovery. Keep in mind that unlike
 page checksums, which are never expected to fail, so even if we miss a
 few errors it's still better than nothing, the WAL checkum is used to
 detect end-of-WAL. There is expected to be a failure every time we do
 crash recovery. This far, we've considered the probability of one in
 1^32 small enough for that purpose, but IMHO one in 1^16 is much too weak.

 One thing that just occurred to me is that we could make the SIMD
 checksum a 32-bit checksum, and reduce it down to 16 bits for the data
 pages. That might give us more flexibility to later use it for WAL
 without compromising on the error detection nearly as much (though
 obviously that wouldn't work with Simon's current proposal which uses
 the same data page checksum in a WAL backup block).

The simple 32bit version of the algorithm would need CPU capability
checks for the fast version and would work only on CPUs produced in
the last few years. Not a show stopper but but more complex code and
less applicability for sure.

An alternative would be to calculate 2 16 bit checksums, concat them
for the 32bit checksum and add them for the 16 bit one. In this case
we wouldn't need to change the current algorithm. A future change
could just factor out everything until the last add as the common
function. But keep in mind that we are talking about sharing about 400
bytes of machine code here.

 In general, we have more flexibility with WAL because there is no
 upgrade issue. It would be nice to share code with the data page
 checksum algorithm; but really we should just use whatever offers the
 best trade-off in terms of complexity, performance, and error detection
 rate.

 I don't think we need to decide all of this right now. Personally, I'm
 satisfied having SIMD checksums on data pages now and leaving WAL
 optimization until later.

+1

I feel quite uneasy about reducing the effectiveness of WAL end
detection. There are many ways to improve WAL performance and I have
no idea what would be the best one. At the very least some performance
tests are in order. As this is not an essential part of having usable
checksums, but a general performance optimization I feel that it is
not fair to others to postpone the release to resolve this now. I'd be
more than happy to research this for 9.4.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small reduction in memory usage of index relcache entries

2013-04-12 Thread Tom Lane

Heikki Linnakangas hlinnakan...@vmware.com writes:
 Looking closer at where that memory is spent, a lot of it goes into the 
 FmgrInfo structs in RelationAmInfo. But some of them are outright unused 
 (ambuild, ambuildempty, amcostestimate, amoptions), and a few others 
 hardly are called so seldom that they hardly need to be cached 
 (ambuildelete, ambacuumcleanup). Removing those fields, patch attached, 
 reduces the memory usage nicely:

+1, but it would look nicer if you could encapsulate the code in some
macro analogous to GET_REL_PROCEDURE(), perhaps call it
GET_UNCACHED_REL_PROCEDURE()?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash Join cost estimates

2013-04-12 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Stephen Frost sfr...@snowman.net writes:
  I'm certainly curious about those, but I'm also very interested in the
  possibility of making NTUP_PER_BUCKET much smaller, or perhaps variable
  depending on the work_mem setting.
 
 Not sure about that.  That would make the hash-bucket-header array
 larger without changing the size of the rest of the hash table, thus
 probably making the CPU cache situation worse not better (which would
 manifest as more time at the first of these two statements relative to
 the second).

In the testing that I've done, I've yet to find a case where having a
smaller NTUP_PER_BUCKET makes things worse and it can have a dramatic
improvement.  Regarding the memory situation, I'm not sure that using
buckets really helps, though I'm no CPU architect.  However, assuming
that the CPU is smart enough to only pull in bits of memory that it
needs, I'm not sure how having to pull in parts of a large array of
pointers is worse than having to go fetch randomly placed entries in a
bucket, especially since we have to step through an entire bucket each
time and the bucket tuples are allocated independently, while the hash
array is allocated once.  Indeed, it seems you'd be more likely to get
other things you need in a given pull into cache for the hash table
array than with the bucket tuple entries which might well include
unrelated garbage.

 Can you add some instrumentation code to get us information about the
 average/max length of the bucket chains?  And maybe try to figure out
 how many distinct hash values per bucket, which would give us a clue
 whether your two-level-list idea is worth anything.

I ended up implementing the two-level system and doing some testing with
it.  It ends up making hash table building take quite a bit longer and
only improves the scan performance in very select cases.  The
improvement requires an individual bucket to have both lots of dups and
a lot of distinct values, because it only helps when it can skip a
significant number of tuples.  If we move to a system where there's
rarely more than one distinct value in a bucket then the chances of a
serious improvment from the two-level system goes down that much more.

As such, I've come up with this (trival) patch which simply modifies
ExecChooseHashTableSize() to ignore NTUP_PER_BUCKET (essentially
treating it as 1) when work_mem is large enough to fit the entire hash
table (which also implies that there is only one batch).  I'd love to
hear feedback from others on what this does under different conditions.
This also makes the hash-the-small-table case faster for the test case
which I provided earlier (http://snowman.net/~sfrost/test_case2.sql),
and it uses quite a bit less memory too.  Now that I've convinced myself
that the two-level system isn't practical and the hash-the-small-table
case is actually faster than hash-the-big-table, I'll start looking
at improving on your proposal to change the costing to favor the smaller
table being hashed more often.

Thanks,

Stephen
colordiff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
new file mode 100644
index 6a2f236..a8a8168
*** a/src/backend/executor/nodeHash.c
--- b/src/backend/executor/nodeHash.c
*** ExecChooseHashTableSize(double ntuples,
*** 489,496 
  		/* We expect the hashtable to fit in memory */
  		double		dbuckets;
  
! 		dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
! 		dbuckets = Min(dbuckets, max_pointers);
  		nbuckets = (int) dbuckets;
  
  		nbatch = 1;
--- 489,495 
  		/* We expect the hashtable to fit in memory */
  		double		dbuckets;
  
! 		dbuckets = Min(ntuples, max_pointers);
  		nbuckets = (int) dbuckets;
  
  		nbatch = 1;


signature.asc
Description: Digital signature

70 matches

Mail list logo