Re: [sqlsmith] Unpinning error in parallel worker

2018-05-22 Thread Thomas Munro
On Wed, Apr 25, 2018 at 8:15 AM, Jonathan Rudenberg
 wrote:
> On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
>> I'll write a patch to fix that unpleasant symptom.  While holding
>> DynamicSharedMemoryControlLock we shouldn't raise any errors without
>> releasing it first, because the error handling path will try to
>> acquire it again.  That's a horrible failure mode as you have
>> discovered.
>>
>> But that isn't the root problem: we shouldn't be raising that error,
>> and I'd love to see the stack of the one process that did that and
>> then self-deadlocked.  I will have another go at trying to reproduce
>> it here today.
>
> Thanks for the update!
>
> We have turned off parallel queries (using max_parallel_workers_per_gather = 
> 0) for now, as the production impact of this bug is unfortunately quite 
> problematic.

Apologies for the delay... I've tried a few times  reproduce the
problem in vain.  There may be a timing element here.

> What will this failure look like with the patch you've proposed?

On second thoughts, I think it depends what is going wrong and I'm not
entirely sure if the result would necessarily be better.  It might be
that errors are raised and normal service resumes or it might be that
we have tangled up our resources in a bad way.  I need to get to the
bottom of this.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-24 Thread Jonathan Rudenberg


On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
> On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
>  wrote:
> > This issue happened again in production, here are the stack traces from 
> > three we grabbed before nuking the >400 hanging backends.
> >
> > [...]
> > #4  0x55fccb93b21c in LWLockAcquire+188() at 
> > /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> > #5  0x55fccb925fa7 in dsm_create+151() at 
> > /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> > #6  0x55fccb6f2a6f in InitializeParallelDSM+511() at 
> > /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> > [...]
> 
> Thank you.  These stacks are all blocked trying to acquire
> DynamicSharedMemoryControlLock.  My theory is that they can't because
> one backend -- the one that emitted the error "FATAL:  cannot unpin a
> segment that is not pinned" -- is deadlocked against itself.  After
> emitting that error you can see from Andreas's "seabisquit" stack that
> that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
> which tries to acquire DynamicSharedMemoryControlLock again, even
> though we already hold it at that point.
> 
> I'll write a patch to fix that unpleasant symptom.  While holding
> DynamicSharedMemoryControlLock we shouldn't raise any errors without
> releasing it first, because the error handling path will try to
> acquire it again.  That's a horrible failure mode as you have
> discovered.
> 
> But that isn't the root problem: we shouldn't be raising that error,
> and I'd love to see the stack of the one process that did that and
> then self-deadlocked.  I will have another go at trying to reproduce
> it here today.

Thanks for the update!

We have turned off parallel queries (using max_parallel_workers_per_gather = 0) 
for now, as the production impact of this bug is unfortunately quite 
problematic.

What will this failure look like with the patch you've proposed?

Thanks again,

Jonathan



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-24 Thread Jonathan Rudenberg


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
>  wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >>  wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a 
> >> > lightly redacted plan. All of the hung queries were running some variant 
> >> > of this plan as far as I can tell.
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung 
> > queries, and there are a few variants. Here's one that's parallel.
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...


This issue happened again in production, here are the stack traces from three 
we grabbed before nuking the >400 hanging backends.

process: /proc/24194/mem
thread: 0x7f46263648c0, lwp: 24194, type: 1
#0  0x7f4625fdc827 in do_futex_wait+55() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x7f4625fdc8d4 in __new_sem_wait_slow+84() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x7f4625fdc97a in __new_sem_wait+58() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x55fccb8c8832 in PGSemaphoreLock+34() at 
/usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x55fccb93b21c in LWLockAcquire+188() at 
/usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x55fccb925fa7 in dsm_create+151() at 
/usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x55fccb6f2a6f in InitializeParallelDSM+511() at 
/usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x55fccb815f27 in ExecInitParallelPlan+807() at 
/usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x55fccb825a50 in ExecGather+1168() at 
/usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x55fccb81dbd3 in ExecAppend+83() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x55fccb81df2c in fetch_input_tuple+172() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x55fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres 
at nodeAgg.c:2539
#12 0x55fccb835235 in CteScanNext+229() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x55fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres 
at execScan.c:97
#14 0x55fccb832155 in ExecNestLoop+165() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x55fccb832155 in ExecNestLoop+165() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#16 0x55fccb832155 in ExecNestLoop+165() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#17 0x55fccb81df2c in fetch_input_tuple+172() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#18 0x55fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres 
at nodeAgg.c:2347
#19 0x55fccb811dfb in standard_ExecutorRun+379() at 
/usr/lib/postgresql/10/bin/postgres at executor.h:250
#20 0x55fccb94d19b in PortalRunSelect+507() at 
/usr/lib/postgresql/10/bin/postgres at pquery.c:932
#21 0x55fccb94e798 in PortalRun+824() at 
/usr/lib/postgresql/10/bin/postgres at pquery.c:773
#22 0x55fccb94b329 in PostgresMain+3721() at 
/usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#23 0x55fccb6840d9 in ServerLoop+3412() at 
/usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#24 0x55fccb8da78b in PostmasterMain+4235() at 
/usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#25 0x55fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at 
main.c:228
#26 0x7f4623c07830 in __libc_start_main+240() at 
/lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#27 0x55fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

process: /proc/24231/mem
thread: 0x7f46263648c0, lwp: 24231, type: 1
#0  0x7f4625fdc827 in do_futex_wait+55() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x7f4625fdc8d4 in __new_sem_wait_slow+84() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x7f4625fdc97a in __new_sem_wait+58() at 
/lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x55fccb8c8832 in PGSemaphoreLock+34() at 
/usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x55fccb93b21c in 

Re: [sqlsmith] Unpinning error in parallel worker

2018-04-20 Thread Jonathan Rudenberg


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
>  wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >>  wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a 
> >> > lightly redacted plan. All of the hung queries were running some variant 
> >> > of this plan as far as I can tell.
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung 
> > queries, and there are a few variants. Here's one that's parallel.
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...

Thanks for investigating! We have monitoring in place and a plan to grab stack 
traces if this happens again. Is there anything else that would be useful for 
us to try to get in addition to a stack trace from the process processing the 
stuck query?



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-19 Thread Thomas Munro
On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
 wrote:
> On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
>> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
>>  wrote:
>> > Yep, I think I know approximately what it looked like, I've attached a 
>> > lightly redacted plan. All of the hung queries were running some variant 
>> > of this plan as far as I can tell.
>>
>> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
>> "Parallel" in there.
>
> Oops, I'm really sorry about that. I only have the first part of the hung 
> queries, and there are a few variants. Here's one that's parallel.

I spent some time trying to reproduce this failure without any luck,
using query plans similar to your Gather plan fragment, and using some
test harness code for the allocator stuff in isolation.

I had an idea that (1) freeing a large object that releases and unpins
a segment in one backend and then (2) freeing it again in another
backend (illegally) might produce this effect with sufficiently bad
luck.  I'm still trying to reproduce that without any success, but I
get other kinds of failures which I think you'd be seeing too if that
hunch were right.  Still looking...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-17 Thread Thomas Munro
On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
 wrote:
> On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote:
>> Thanks, that would be much appreciated, as would any clues about what
>> workload you're running.  Do you know what the query plan looks like
>> for the queries that crashed?
>
> Yep, I think I know approximately what it looked like, I've attached a 
> lightly redacted plan. All of the hung queries were running some variant of 
> this plan as far as I can tell.

Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
"Parallel" in there.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-17 Thread Jonathan Rudenberg


On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote:

> I don't have any theories about how that could be going wrong right
> now, but I'm looking into it. 

Thank you!

> > I don't have a backtrace yet, but I will provide them if/when the issue 
> > happens again.
> 
> Thanks, that would be much appreciated, as would any clues about what
> workload you're running.  Do you know what the query plan looks like
> for the queries that crashed?

Yep, I think I know approximately what it looked like, I've attached a lightly 
redacted plan. All of the hung queries were running some variant of this plan 
as far as I can tell.

Thanks again!

Jonathan



QUERY PLAN  


  
--
 Sort  (cost=1896.13..1896.13 rows=2 width=4336) (actual time=29.396..29.447 
rows=410 loops=1)
   Sort Key: assets."position"
   Sort Method: quicksort  Memory: 1671kB
   CTE assets
 ->  HashAggregate  (cost=1861.88..1863.61 rows=173 width=864) (actual 
time=23.328..23.883 rows=1242 loops=1)
   Group Key: assets_7.id, assets_7.name, assets_7.description, 
assets_7.asset_key, assets_7.type, assets_7.created_at, assets_7.updated_at, 
assets_7.deleted_at, assets_7.redacted_id, assets_7.section_id, 
assets_7."position", assets_7.creator_id, assets_7.modifier_id, assets_7.data, 
assets_7.thumbnail_override, assets_7.availability_start, 
assets_7.availability_end, assets_7.availability
   ->  Append  (cost=853.88..1854.10 rows=173 width=864) (actual 
time=10.871..22.002 rows=1242 loops=1)
 ->  CTE Scan on assets assets_7  (cost=853.88..861.60 rows=2 
width=864) (actual time=10.870..12.339 rows=1242 loops=1)
   Filter: (redacted_id = 32456)
   CTE assets
 ->  HashAggregate  (cost=850.45..853.88 rows=343 
width=864) (actual time=10.867..11.392 rows=1242 loops=1)
   Group Key: assets_1.id, assets_1.name, 
assets_1.description, assets_1.asset_key, assets_1.type, assets_1.created_at, 
assets_1.updated_at, assets_1.deleted_at, assets_1.redacted_id, 
assets_1.section_id, assets_1."position", assets_1.creator_id, 
assets_1.modifier_id, assets_1.data, assets_1.thumbnail_override, 
assets_1.availability_start, assets_1.availability_end, assets_1.availability
   ->  Append  (cost=1.41..835.02 rows=343 
width=864) (actual time=0.380..9.456 rows=1261 loops=1)
 ->  Nested Loop  (cost=1.41..174.78 rows=1 
width=249) (actual time=0.046..0.046 rows=0 loops=1)
   ->  Nested Loop  (cost=0.98..5.71 
rows=1 width=4) (actual time=0.046..0.046 rows=0 loops=1)
 ->  Nested Loop  
(cost=0.70..5.14 rows=1 width=8) (actual time=0.046..0.046 rows=0 loops=1)
   ->  Index Scan using 
index_user_id_on_permissible_type on user_permissions  (cost=0.42..2.64 rows=1 
width=4) (actual time=0.045..0.045 rows=0 loops=1)
 Index Cond: 
((user_id = 98789) AND ((permissible_type)::text = 'Organization'::text))
 Filter: 
((permission_level)::text = 'guest'::text)
   ->  Index Only Scan 
using organizations_pkey on organizations  (cost=0.28..2.50 rows=1 width=4) 
(never executed)
 Index Cond: (id = 
user_permissions.permissible_id)
 Heap Fetches: 0
 ->  Index Scan using 
index_redacteds_on_organization_id on redacteds redacteds_1  (cost=0.28..0.43 
rows=1 width=8) (never executed)
   Index Cond: 
(organization_id = organizations.id)
  

Re: [sqlsmith] Unpinning error in parallel worker

2018-04-17 Thread Thomas Munro
On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenberg
 wrote:
> Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock 
> and pg_terminate_backend did not terminate the queries.
>
> In the log:
>
>> FATAL:  cannot unpin a segment that is not pinned

Thanks for the report.  That error is reachable via two paths:

1.  Cleanup of a DSA area at the end of a query, giving back all
segments.  This is how the bug originally reported in this thread
reached it, and that's because of a case where we tried to
double-destroy the DSA area when refcount went down to zero, then back
up again, and then back to zero (late starting parallel worker that
attached in a narrow time window).  That was fixed in fddf45b3: once
it reaches zero we recognise it as already destroyed and don't even
let anyone attach.

2.  In destroy_superblock(), called by dsa_free(), when we're where
we've determined that a 64kb superblock can be given back to the DSM
segment, and that the DSM segment is now entirely free so can be given
back to the operating system.  To do that, after we put the pages back
into the free page manager we test fpm_largest(segment_map->fpm) ==
segment_map->header->usable_pages to see if the largest span of free
pages is now the same size as the whole segment.

I don't have any theories about how that could be going wrong right
now, but I'm looking into it.  There could be a logic bug in dsa.c, or
a logic bug in client code running an invalid sequence of
dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a
well timed double dsa_free() could produce this effect), or a
common-or-garden overrun bug somewhere that trashes control state.

> I don't have a backtrace yet, but I will provide them if/when the issue 
> happens again.

Thanks, that would be much appreciated, as would any clues about what
workload you're running.  Do you know what the query plan looks like
for the queries that crashed?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [sqlsmith] Unpinning error in parallel worker

2018-04-17 Thread Jonathan Rudenberg
On Wed, Mar 29, 2017, at 10:50, Robert Haas wrote:
> On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro
>  wrote:
> > I considered whether the error message could be improved but it
> > matches the message for an existing similar case (where you try to
> > attach to an unknown handle).
> 
> Ugh, OK.  I committed this, but I think this whole file needs a visit
> from the message style police.

I ran into an issue that matches the symptoms of this bug on a production 
cluster running 10.3 (Ubuntu 10.3-1.pgdg16.04+1) twice in the past week.

Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock 
and pg_terminate_backend did not terminate the queries.

In the log:

> FATAL:  cannot unpin a segment that is not pinned

I don't have a backtrace yet, but I will provide them if/when the issue happens 
again.

Jonathan