Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-31 Thread daveg
On Sun, Jul 31, 2011 at 11:44:39AM -0400, Tom Lane wrote:
> daveg  writes:
> > Here is the update: the problem happens with vacuum full alone, no reindex
> > is needed to trigger it. I updated the script to avoid reindexing after
> > vacuum. Over the past two days there are still many ocurrances of this
> > error coincident with the vacuum.
> 
> Well, that jives with the assumption that the one case we saw in
> the buildfarm was the same thing, because the regression tests were
> certainly only doing a VACUUM FULL and not a REINDEX of pg_class.
> But it doesn't get us much closer to understanding what's happening.
> In particular, it seems to knock out most ideas associated with race
> conditions, because the VAC FULL should hold exclusive lock on pg_class
> until it's completely done (including index rebuilds).
> 
> I think we need to start adding some instrumentation so we can get a
> better handle on what's going on in your database.  If I were to send
> you a source-code patch for the server that adds some more logging
> printout when this happens, would you be willing/able to run a patched
> build on your machine?

Yes we can run an instrumented server so long as the instrumentation does
not interfere with normal operation. However, scheduling downtime to switch
binaries is difficult, and generally needs to be happen on a weekend, but
sometimes can be expedited. I'll look into that.

> (BTW, just to be perfectly clear ... the "could not find pg_class tuple"
> errors always mention index 2662, right, never any other number?)

Yes, only index 2662, never any other.

I'm attaching a somewhat redacted log for two different databases on the same
instance around the time of vacuum full of pg_class in each database.
My observations so far are:

 - the error occurs at commit of vacuum full of pg_class
 - in these cases error hits autovacuum after it waited for a lock on pg_class
 - in these two cases there was a new process startup while the vacuum was
   running. Don't know if this is relevant.
 - while these hit autovacuum, the error does hit other processs (just not in
   these sessions).  Unknown if autovacuum is a required component.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.
NOTES:
-- most statements for db c01 in one catalog vacuum session
-- database c01 is oid 16404
-- applications run as user 'app'
-- vacuums run as user postgres
-- somewhat redacted

MM:SS.ms   piduser  log text
-  -    -

03:00.350  2716   c01  postgres  vacuum full pg_catalog.pg_class;
03:00.507  2866  LOG: process 2866 acquired ShareUpdateExclusiveLock on 
relation 2604 of database 16416 after 10143.750 ms
03:01.222  21685  c01  app   11100.670 ms
03:01.223  21685  c01  app  SELECT DISTINCT
03:01.224  21685  c01  app   1.524 ms
03:02.031  3001   c01  app  connection authorized: user=app database=c01
03:04.952  2787  LOG: process 2787 still waiting for AccessShareLock on 
relation 1259 of database 16412 after 5001.035 ms
03:05.065  2787  LOG: process 2787 acquired AccessShareLock on relation 1259 of 
database 16412 after 5114.253 ms
03:05.564  2977  LOG: process 2977 still waiting for AccessShareLock on 
relation 1259 of database 16404 after 5000.970 ms
03:05.640  2731  LOG: process 2731 still waiting for RowExclusiveLock on 
relation 1259 of database 16404 after 5000.186 ms
03:06.045  2977  LOG: process 2977 acquired AccessShareLock on relation 1259 of 
database 16404 after 5482.389 ms
03:06.045  2731  LOG: process 2731 acquired RowExclusiveLock on relation 1259 
of database 16404 after 5405.652 ms
03:06.046  2731  ERROR: could not find pg_class tuple for index 2662
03:06.046  2731  CONTEXT: automatic vacuum of table "c01.pg_catalog.pg_index"
03:06.046  2731  ERROR: could not find pg_class tuple for index 2662
03:06.046  2716   c01  postgres   5696.537 ms
03:06.056  3001   c01  app  SET SESSION TIME ZONE 'UTC'; ...
03:06.057  2716   c01  postgres  vacuum full pg_catalog.pg_rewrite;
03:06.066  3001   c01  app   10.459 ms
03:06.091  3001   c01  app  disconnection: session time: 0:00:04.085 user=app 
database=c01 host=xxxl01
03:08.908  3006  LOG: process 3006 still waiting for AccessShareLock on 
relation 2659 of database 16407 after 5000.778 ms
03:11.777  21685  c01  app  SELECT
03:11.779  21685  c01  app   2.296 ms
03:11.779  21685  c01  app  SELECT
03:11.780  21685  c01  app   0.328 ms
03:11.798  21685  c01  app  SELECT
03:11.799  21685  c01  app   0.348 ms
03:11.800  21685  c01  app  SELECT
03:11.800  21685  c01  app   0.205 ms
03:11.804  21685  c01  app  SELECT
03:11.805  21685  c01  app   0.589 ms
03:11.806  21685  c01  app  SELECT DISTINCT
03:11.809  21685  c01  app   3.552 ms
03:11.810  21685  c01  app  SELECT
03:12.554  21685  c01  app   744.233 ms
03:12.555  21685  c01  app  SELECT DISTINCT
03:12.556  21685  c01  app   1.136 ms
03:13.153  2716   c01  post

Re: [HACKERS] SSI heap_insert and page-level predicate locks

2011-07-31 Thread Jeff Davis
On Wed, 2011-06-08 at 17:29 -0500, Kevin Grittner wrote:
> Heikki Linnakangas  wrote:
> > heap_insert() calls CheckForSerializableConflictIn(), which checks if
> > there is a predicate lock on the whole relation, or on the page we're
> > inserting to. It does not check for tuple-level locks, because there
> > can't be any locks on a tuple that didn't exist before.
> > AFAICS, the check for page lock is actually unnecessary. A page-level
> > lock on a heap only occurs when tuple-level locks are promoted. It is
> > just a coarser-grain representation of holding locks on all tuples on
> > the page, *that exist already*. It is not a "gap" lock like the index
> > locks are, it doesn't need to conflict with inserting new tuples on
> the 
> > page. In fact, if heap_insert chose to insert the tuple on some other
> > heap page, there would have been no conflict.
>  
> Absolutely correct.  Patch attached.

I like the change, but the comment is slightly confusing. Perhaps
something like:

"For a heap insert, we only need to check for table locks. Our new tuple
can't possibly conflict with existing tuple locks, and heap page locks
are only consolidated versions of tuple locks. The index insert will
check for any other predicate locks."

would be a little more clear? (the last sentence is optional, and I only
included it because the original comment mentioned indexes).

Same for heap_update().

Regards,
Jeff Davis






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lazy vxid locks, v3

2011-07-31 Thread Jeff Davis
On Wed, 2011-07-20 at 13:41 -0400, Robert Haas wrote:
> I took another look at v2 of my lazy vxid locks patch and realized
> that it was pretty flaky in a couple of different ways.  Here's a
> version that I think is a bit more robust, but considering the extent
> of the revisions, it probably needs another round of review from
> someone before I commit it.
> 
> Any review appreciated; I would prefer not to have to wait until
> October to get this committed, since there is quite a bit of follow-on
> work that I would like to do as well.  FWIW, the performance
> characteristics are basically identical to the previous versions,
> AFAICT.
> 

fpLocalTransactionId is redundant with the lxid, and the explanation is
that one that they have different locking semantics. That looks
reasonable, and it avoided the need for the careful ordering while
starting/ending a transaction that was present in v2.

However, it also looks like you're using it for another purpose:

In VirtualXactLockTableCleanup():
/*
 * If fpVXIDLock has been cleared without touching fpLocalTransactionId,
 * that means someone transferred the lock to the main lock table.
 */
if (!fastpath && LocalTransactionIdIsValid(lxid))

Is the "&& LocalTransactionIdIsValid(lxid)" a guard against calling
VirtualXactLockTableCleanup twice? Can that happen? Or is it just
defensive coding to avoid making an additional assumption?

Regards,
Jeff Davis

PS: In the recent sinval synch patch, you had a typo: "If we haven't
catch up completely". Other than that, it looked good.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] python cleanup

2011-07-31 Thread Andrew Dunstan



On 07/25/2011 12:03 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 07/25/2011 10:52 AM, Tom Lane wrote:

What is features.h, and have its authors read the POSIX standard?
AFAICS they have no business defining this symbol.

 [andrew@emma ~]$ rpm -q -f /usr/include/features.h
 glibc-headers-2.13-1.x86_64

Oh, for some reason I was thinking this was mingw-specific.

[ pokes around ... ]  I still think it's a bad idea for the header
files to be defining this, but they'll probably point at the part
of the POSIX spec that says the results are undefined if the macro
is changed after the first system header is #included.

I can't immediately think of any way to actually do what you were
trying to do (ie, save and restore the definition of the macro).
I wonder whether it would be good enough to do this:

#include postgres.h

#include everything else we want except python headers

#undef _POSIX_C_SOURCE
#undef _XOPEN_SOURCE

#include python headers

... rest of .c file ...

This should only fail if (a) some macro imported from system headers
attempts to test the value of a feature macro, and (b) the results
vary between the system default setting and the setting the python
headers selected.  Neither of these things seem very probable.



OK, attached gives a clean build and passes regression on my Windows box 
that builds with Python. I had to undefine a few more things and save 
and restore our *snprintf settings (with code borrowed from plperl.h, 
where we did this sort of cleanup a while ago).


cheers

andrew


diff --git a/src/pl/plpython/plpython.c b/src/pl/plpython/plpython.c
index e03d7ce..da34a17 100644
--- a/src/pl/plpython/plpython.c
+++ b/src/pl/plpython/plpython.c
@@ -6,6 +6,56 @@
  *
  */
 
+#include "postgres.h"
+
+/* system stuff */
+#include 
+#include 
+
+/* postgreSQL stuff */
+#include "catalog/pg_proc.h"
+#include "catalog/pg_type.h"
+#include "commands/trigger.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "fmgr.h"
+#include "mb/pg_wchar.h"
+#include "miscadmin.h"
+#include "nodes/makefuncs.h"
+#include "parser/parse_type.h"
+#include "tcop/tcopprot.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "utils/builtins.h"
+#include "utils/hsearch.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+#include "utils/typcache.h"
+
+/*
+ * Undefine some things that get (re)defined in the
+ * Python headers. They aren't used below and we've
+ * already included all the headers we need, so this
+ * should be pretty safe.
+ */
+
+#undef _POSIX_C_SOURCE
+#undef _XOPEN_SOURCE
+#undef HAVE_STRERROR
+#undef HAVE_TZNAME
+
+/*
+ * Sometimes python carefully scribbles on our *printf macros.
+ * So we undefine them here and redefine them after it's done its dirty deed.
+ */
+
+#ifdef USE_REPL_SNPRINTF
+#undef snprintf
+#undef vsnprintf
+#endif
+
 #if defined(_MSC_VER) && defined(_DEBUG)
 /* Python uses #pragma to bring in a non-default libpython on VC++ if
  * _DEBUG is defined */
@@ -84,34 +134,6 @@ typedef int Py_ssize_t;
 		PyObject_HEAD_INIT(type) size,
 #endif
 
-#include "postgres.h"
-
-/* system stuff */
-#include 
-#include 
-
-/* postgreSQL stuff */
-#include "catalog/pg_proc.h"
-#include "catalog/pg_type.h"
-#include "commands/trigger.h"
-#include "executor/spi.h"
-#include "funcapi.h"
-#include "fmgr.h"
-#include "mb/pg_wchar.h"
-#include "miscadmin.h"
-#include "nodes/makefuncs.h"
-#include "parser/parse_type.h"
-#include "tcop/tcopprot.h"
-#include "access/transam.h"
-#include "access/xact.h"
-#include "utils/builtins.h"
-#include "utils/hsearch.h"
-#include "utils/lsyscache.h"
-#include "utils/memutils.h"
-#include "utils/rel.h"
-#include "utils/syscache.h"
-#include "utils/typcache.h"
-
 /* define our text domain for translations */
 #undef TEXTDOMAIN
 #define TEXTDOMAIN PG_TEXTDOMAIN("plpython")
@@ -119,6 +141,23 @@ typedef int Py_ssize_t;
 #include 
 #include 
 
+/* put back our snprintf and vsnprintf */
+#ifdef USE_REPL_SNPRINTF
+#ifdef snprintf
+#undef snprintf
+#endif
+#ifdef vsnprintf
+#undef vsnprintf
+#endif
+#ifdef __GNUC__
+#define vsnprintf(...)  pg_vsnprintf(__VA_ARGS__)
+#define snprintf(...)   pg_snprintf(__VA_ARGS__)
+#else
+#define vsnprintf   pg_vsnprintf
+#define snprintfpg_snprintf
+#endif   /* __GNUC__ */
+#endif   /* USE_REPL_SNPRINTF */
+
 PG_MODULE_MAGIC;
 
 /* convert Postgresql Datum or tuple into a PyObject.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libedit memory stomp is apparently fixed in OS X Lion

2011-07-31 Thread Dave Page
On Sun, Jul 31, 2011 at 6:45 PM, Alvaro Herrera
 wrote:
> Excerpts from Dave Page's message of sáb jul 30 15:32:03 -0400 2011:
>> On Sat, Jul 30, 2011 at 8:25 PM, Tom Lane  wrote:
>> > I think you had better plan on incorporating GNU readline into installer
>> > builds for Lion.
>>
>> Unfortunately the licence makes that a non-starter.
>
> A fixed version of libedit, then?

Yes, that's the best we can do.


-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libedit memory stomp is apparently fixed in OS X Lion

2011-07-31 Thread Alvaro Herrera
Excerpts from Dave Page's message of sáb jul 30 15:32:03 -0400 2011:
> On Sat, Jul 30, 2011 at 8:25 PM, Tom Lane  wrote:
> > I think you had better plan on incorporating GNU readline into installer
> > builds for Lion.
> 
> Unfortunately the licence makes that a non-starter.

A fixed version of libedit, then?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-31 Thread Tom Lane
daveg  writes:
> Here is the update: the problem happens with vacuum full alone, no reindex
> is needed to trigger it. I updated the script to avoid reindexing after
> vacuum. Over the past two days there are still many ocurrances of this
> error coincident with the vacuum.

Well, that jives with the assumption that the one case we saw in
the buildfarm was the same thing, because the regression tests were
certainly only doing a VACUUM FULL and not a REINDEX of pg_class.
But it doesn't get us much closer to understanding what's happening.
In particular, it seems to knock out most ideas associated with race
conditions, because the VAC FULL should hold exclusive lock on pg_class
until it's completely done (including index rebuilds).

I think we need to start adding some instrumentation so we can get a
better handle on what's going on in your database.  If I were to send
you a source-code patch for the server that adds some more logging
printout when this happens, would you be willing/able to run a patched
build on your machine?

(BTW, just to be perfectly clear ... the "could not find pg_class tuple"
errors always mention index 2662, right, never any other number?)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error: could not find pg_class tuple for index 2662

2011-07-31 Thread daveg
On Thu, Jul 28, 2011 at 11:31:31PM -0700, daveg wrote:
> On Thu, Jul 28, 2011 at 07:45:01PM -0400, Robert Haas wrote:
> > REINDEX.  My guess is that this is happening either right around the
> > time the VACUUM FULL commits or right around the time the REINDEX
> > commits.  It'd be helpful to know which, if you can figure it out.
> 
> I'll update my vacuum script to skip reindexes after vacuum full for 9.0
> servers and see if that makes the problem go away. Thanks for reminding
> me that they are not needed. However, I suspect it is the vacuum, not the
> reindex causing the problem. I'll update when I know.

Here is the update: the problem happens with vacuum full alone, no reindex
is needed to trigger it. I updated the script to avoid reindexing after
vacuum. Over the past two days there are still many ocurrances of this
error coincident with the vacuum.

-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers