Re: [GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-10 Thread Thom Brown
On 10 March 2014 15:32, hubert depesz lubaczewski  wrote:
> On Thu, Mar 06, 2014 at 06:03:54PM +0100, hubert depesz lubaczewski wrote:
>> On Thu, Mar 06, 2014 at 12:02:50PM -0500, Tom Lane wrote:
>> > hubert depesz lubaczewski  writes:
>> > > I didn't have a chance to do it. Can try if there is a way to get trace
>> > > *without* making core (sorry, my c/gdb knowledge is very, very limited).
>> >
>> > Sure, you just attach to the process:
>> >
>> > $ gdb /path/to/postgres PID-of-process
>> > gdb> bt
>> > gdb> quit
>> >
>> > This is usually preferable to forcing a core dump.
>>
>> Thank you. If the problem will strike again, I will do it on all (or
>> most, depending how fast I can make it) backends.
>
> The problem did happen again, and we were able to find a fix (I think).
> For some reason we had a table with over 5 (yes, 50 thousand)
> indexes on it. This table was a bucardo internals table, so maybe it was
> something in bucardo (we are using it to migrate hundreds of tables to
> another machine, so maybe it has something to do with it.

This sort of thing is the reason why I'd want to see index maintenance
nodes in explain (analyse) plans, so that it's possible to gauge their
contribution to the overall duration of a DML statement.
-- 
Thom


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-10 Thread hubert depesz lubaczewski
On Thu, Mar 06, 2014 at 06:03:54PM +0100, hubert depesz lubaczewski wrote:
> On Thu, Mar 06, 2014 at 12:02:50PM -0500, Tom Lane wrote:
> > hubert depesz lubaczewski  writes:
> > > I didn't have a chance to do it. Can try if there is a way to get trace
> > > *without* making core (sorry, my c/gdb knowledge is very, very limited).
> > 
> > Sure, you just attach to the process:
> > 
> > $ gdb /path/to/postgres PID-of-process
> > gdb> bt
> > gdb> quit
> > 
> > This is usually preferable to forcing a core dump.
> 
> Thank you. If the problem will strike again, I will do it on all (or
> most, depending how fast I can make it) backends.

The problem did happen again, and we were able to find a fix (I think).
For some reason we had a table with over 5 (yes, 50 thousand)
indexes on it. This table was a bucardo internals table, so maybe it was
something in bucardo (we are using it to migrate hundreds of tables to
another machine, so maybe it has something to do with it.

Anyway - after removing obsolete indexes there - the problem is gone.

Best regards,

depesz



signature.asc
Description: Digital signature


Re: [GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-06 Thread hubert depesz lubaczewski
On Thu, Mar 06, 2014 at 12:02:50PM -0500, Tom Lane wrote:
> hubert depesz lubaczewski  writes:
> > I didn't have a chance to do it. Can try if there is a way to get trace
> > *without* making core (sorry, my c/gdb knowledge is very, very limited).
> 
> Sure, you just attach to the process:
> 
>   $ gdb /path/to/postgres PID-of-process
>   gdb> bt
>   gdb> quit
> 
> This is usually preferable to forcing a core dump.

Thank you. If the problem will strike again, I will do it on all (or
most, depending how fast I can make it) backends.

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/


signature.asc
Description: Digital signature


Re: [GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-06 Thread Tom Lane
hubert depesz lubaczewski  writes:
> I didn't have a chance to do it. Can try if there is a way to get trace
> *without* making core (sorry, my c/gdb knowledge is very, very limited).

Sure, you just attach to the process:

$ gdb /path/to/postgres PID-of-process
gdb> bt
gdb> quit

This is usually preferable to forcing a core dump.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-06 Thread hubert depesz lubaczewski
On Thu, Mar 06, 2014 at 11:56:06AM -0500, Tom Lane wrote:
> hubert depesz lubaczewski  writes:
> > Thread 1 (LWP 21422):
> > #0  0x7ffa60da2dc7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
> > No symbol table info available.
> > #1  0x005f65e8 in PGSemaphoreLock ()
> > No symbol table info available.
> > #2  0x00636125 in LWLockAcquire ()
> > No symbol table info available.
> > #3  0x00630f91 in LockAcquireExtended ()
> > No symbol table info available.
> > #4  0x0062f88c in LockRelationOid ()
> > No symbol table info available.
> > #5  0x00470f6d in relation_open ()
> > No symbol table info available.
> 
> Huh.  Looks like it's blocked trying to acquire one of the lock-partition
> LWLocks, which is odd because those ought never be held very long.
> Somebody else has failed to release that LWLock, looks like.
> 
> Did you by any chance capture stack traces from all of the backends?
> The interesting one would be the one that *doesn't* look like this.
> Or possibly there's more than one such.

I didn't have a chance to do it. Can try if there is a way to get trace
*without* making core (sorry, my c/gdb knowledge is very, very limited).

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/


signature.asc
Description: Digital signature


Re: [GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-06 Thread Tom Lane
hubert depesz lubaczewski  writes:
> Thread 1 (LWP 21422):
> #0  0x7ffa60da2dc7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
> No symbol table info available.
> #1  0x005f65e8 in PGSemaphoreLock ()
> No symbol table info available.
> #2  0x00636125 in LWLockAcquire ()
> No symbol table info available.
> #3  0x00630f91 in LockAcquireExtended ()
> No symbol table info available.
> #4  0x0062f88c in LockRelationOid ()
> No symbol table info available.
> #5  0x00470f6d in relation_open ()
> No symbol table info available.

Huh.  Looks like it's blocked trying to acquire one of the lock-partition
LWLocks, which is odd because those ought never be held very long.
Somebody else has failed to release that LWLock, looks like.

Did you by any chance capture stack traces from all of the backends?
The interesting one would be the one that *doesn't* look like this.
Or possibly there's more than one such.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] 9.1.11 - many backends in "semtimedop" syscall

2014-03-06 Thread hubert depesz lubaczewski
Hi,
Ihave following situation:
amazon, but not RDS - just plain EC2, with ubuntu system.
zone reclaiming is disabled (at least in my system, no idea about host).
and there is no "transparent huge page support".

today we had at least 3 cases where bunch of abckends (~40) gets
"interlocked".
Some of them are shown as "waiting" in pg_stat_activity, but not all.
simple "insert into table ()" - without any triggers - gets killed after
10 minutes with no visible progress.

From what I saw - most of the backends are "locked" (though it's not
visible in pg_stat_actrivity) on the same table.

Did take coredump from one of the backends, but there are no symbols, so
the backtrace is just:

root@xx:/mnt# gdb --batch --quiet -ex "thread apply all bt full" -ex 
"quit" /usr/lib/postgresql/9.1/bin/postgres pg.core.21422
[New LWP 21422]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres'.
#0  0x7ffa60da2dc7 in semop () from /lib/x86_64-linux-gnu/libc.so.6

Thread 1 (LWP 21422):
#0  0x7ffa60da2dc7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x005f65e8 in PGSemaphoreLock ()
No symbol table info available.
#2  0x00636125 in LWLockAcquire ()
No symbol table info available.
#3  0x00630f91 in LockAcquireExtended ()
No symbol table info available.
#4  0x0062f88c in LockRelationOid ()
No symbol table info available.
#5  0x00470f6d in relation_open ()
No symbol table info available.
#6  0x0047b013 in index_open ()
No symbol table info available.
#7  0x0057bb4c in ExecOpenIndices ()
No symbol table info available.
#8  0x005894c8 in ExecInitModifyTable ()
No symbol table info available.
#9  0x0057266a in ExecInitNode ()
No symbol table info available.
#10 0x00570e4a in standard_ExecutorStart ()
No symbol table info available.
#11 0x00593406 in ?? ()
No symbol table info available.
#12 0x00593947 in SPI_execute_plan_with_paramlist ()
No symbol table info available.
#13 0x7ff8c34c8aed in ?? () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#14 0x7ff8c34c9716 in ?? () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#15 0x7ff8c34ca252 in ?? () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#16 0x7ff8c34ca252 in ?? () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#17 0x7ff8c34cc9dc in ?? () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#18 0x7ff8c34ccf01 in plpgsql_exec_trigger () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#19 0x7ff8c34c214a in plpgsql_call_handler () from
/usr/lib/postgresql/9.1/lib/plpgsql.so
No symbol table info available.
#20 0x00709d67 in ?? ()
No symbol table info available.
#21 0x00555fda in ?? ()
No symbol table info available.
#22 0x0055668f in ?? ()
No symbol table info available.
#23 0x0055ba6a in AfterTriggerEndQuery ()
No symbol table info available.
#24 0x0056febf in standard_ExecutorFinish ()
No symbol table info available.
#25 0x00645a2a in ?? ()
No symbol table info available.
#26 0x00645c13 in ?? ()
No symbol table info available.
#27 0x00646962 in PortalRun ()
No symbol table info available.
#28 0x0064274a in PostgresMain ()
No symbol table info available.
#29 0x00604443 in ?? ()
No symbol table info available.
#30 0x00604eb1 in PostmasterMain ()
No symbol table info available.
#31 0x0045a720 in main ()
No symbol table info available.


What could that be, and how to get rid of the problem?

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/


signature.asc
Description: Digital signature