Re: [sqlsmith] Unpinning error in parallel worker
On Wed, Apr 25, 2018 at 8:15 AM, Jonathan Rudenbergwrote: > On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote: >> I'll write a patch to fix that unpleasant symptom. While holding >> DynamicSharedMemoryControlLock we shouldn't raise any errors without >> releasing it first, because the error handling path will try to >> acquire it again. That's a horrible failure mode as you have >> discovered. >> >> But that isn't the root problem: we shouldn't be raising that error, >> and I'd love to see the stack of the one process that did that and >> then self-deadlocked. I will have another go at trying to reproduce >> it here today. > > Thanks for the update! > > We have turned off parallel queries (using max_parallel_workers_per_gather = > 0) for now, as the production impact of this bug is unfortunately quite > problematic. Apologies for the delay... I've tried a few times reproduce the problem in vain. There may be a timing element here. > What will this failure look like with the patch you've proposed? On second thoughts, I think it depends what is going wrong and I'm not entirely sure if the result would necessarily be better. It might be that errors are raised and normal service resumes or it might be that we have tangled up our resources in a bad way. I need to get to the bottom of this. -- Thomas Munro http://www.enterprisedb.com
Re: [sqlsmith] Unpinning error in parallel worker
On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote: > On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg >wrote: > > This issue happened again in production, here are the stack traces from > > three we grabbed before nuking the >400 hanging backends. > > > > [...] > > #4 0x55fccb93b21c in LWLockAcquire+188() at > > /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 > > #5 0x55fccb925fa7 in dsm_create+151() at > > /usr/lib/postgresql/10/bin/postgres at dsm.c:493 > > #6 0x55fccb6f2a6f in InitializeParallelDSM+511() at > > /usr/lib/postgresql/10/bin/postgres at parallel.c:266 > > [...] > > Thank you. These stacks are all blocked trying to acquire > DynamicSharedMemoryControlLock. My theory is that they can't because > one backend -- the one that emitted the error "FATAL: cannot unpin a > segment that is not pinned" -- is deadlocked against itself. After > emitting that error you can see from Andreas's "seabisquit" stack that > that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach() > which tries to acquire DynamicSharedMemoryControlLock again, even > though we already hold it at that point. > > I'll write a patch to fix that unpleasant symptom. While holding > DynamicSharedMemoryControlLock we shouldn't raise any errors without > releasing it first, because the error handling path will try to > acquire it again. That's a horrible failure mode as you have > discovered. > > But that isn't the root problem: we shouldn't be raising that error, > and I'd love to see the stack of the one process that did that and > then self-deadlocked. I will have another go at trying to reproduce > it here today. Thanks for the update! We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of this bug is unfortunately quite problematic. What will this failure look like with the patch you've proposed? Thanks again, Jonathan
Re: [sqlsmith] Unpinning error in parallel worker
On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg >wrote: > > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > >> wrote: > >> > Yep, I think I know approximately what it looked like, I've attached a > >> > lightly redacted plan. All of the hung queries were running some variant > >> > of this plan as far as I can tell. > >> > >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and > >> "Parallel" in there. > > > > Oops, I'm really sorry about that. I only have the first part of the hung > > queries, and there are a few variants. Here's one that's parallel. > > I spent some time trying to reproduce this failure without any luck, > using query plans similar to your Gather plan fragment, and using some > test harness code for the allocator stuff in isolation. > > I had an idea that (1) freeing a large object that releases and unpins > a segment in one backend and then (2) freeing it again in another > backend (illegally) might produce this effect with sufficiently bad > luck. I'm still trying to reproduce that without any success, but I > get other kinds of failures which I think you'd be seeing too if that > hunch were right. Still looking... This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging backends. process: /proc/24194/mem thread: 0x7f46263648c0, lwp: 24194, type: 1 #0 0x7f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x7f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x7f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29 #3 0x55fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310 #4 0x55fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 #5 0x55fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 #6 0x55fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 #7 0x55fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470 #8 0x55fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158 #9 0x55fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #10 0x55fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #11 0x55fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539 #12 0x55fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #13 0x55fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97 #14 0x55fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #15 0x55fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #16 0x55fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #17 0x55fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #18 0x55fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347 #19 0x55fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #20 0x55fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932 #21 0x55fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773 #22 0x55fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984 #23 0x55fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405 #24 0x55fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363 #25 0x55fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228 #26 0x7f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291 #27 0x55fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres process: /proc/24231/mem thread: 0x7f46263648c0, lwp: 24231, type: 1 #0 0x7f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x7f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x7f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29 #3 0x55fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310 #4 0x55fccb93b21c in
Re: [sqlsmith] Unpinning error in parallel worker
On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg >wrote: > > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > >> wrote: > >> > Yep, I think I know approximately what it looked like, I've attached a > >> > lightly redacted plan. All of the hung queries were running some variant > >> > of this plan as far as I can tell. > >> > >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and > >> "Parallel" in there. > > > > Oops, I'm really sorry about that. I only have the first part of the hung > > queries, and there are a few variants. Here's one that's parallel. > > I spent some time trying to reproduce this failure without any luck, > using query plans similar to your Gather plan fragment, and using some > test harness code for the allocator stuff in isolation. > > I had an idea that (1) freeing a large object that releases and unpins > a segment in one backend and then (2) freeing it again in another > backend (illegally) might produce this effect with sufficiently bad > luck. I'm still trying to reproduce that without any success, but I > get other kinds of failures which I think you'd be seeing too if that > hunch were right. Still looking... Thanks for investigating! We have monitoring in place and a plan to grab stack traces if this happens again. Is there anything else that would be useful for us to try to get in addition to a stack trace from the process processing the stuck query?
Re: [sqlsmith] Unpinning error in parallel worker
On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenbergwrote: > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg >> wrote: >> > Yep, I think I know approximately what it looked like, I've attached a >> > lightly redacted plan. All of the hung queries were running some variant >> > of this plan as far as I can tell. >> >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and >> "Parallel" in there. > > Oops, I'm really sorry about that. I only have the first part of the hung > queries, and there are a few variants. Here's one that's parallel. I spent some time trying to reproduce this failure without any luck, using query plans similar to your Gather plan fragment, and using some test harness code for the allocator stuff in isolation. I had an idea that (1) freeing a large object that releases and unpins a segment in one backend and then (2) freeing it again in another backend (illegally) might produce this effect with sufficiently bad luck. I'm still trying to reproduce that without any success, but I get other kinds of failures which I think you'd be seeing too if that hunch were right. Still looking... -- Thomas Munro http://www.enterprisedb.com
Re: [sqlsmith] Unpinning error in parallel worker
On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenbergwrote: > On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: >> Thanks, that would be much appreciated, as would any clues about what >> workload you're running. Do you know what the query plan looks like >> for the queries that crashed? > > Yep, I think I know approximately what it looked like, I've attached a > lightly redacted plan. All of the hung queries were running some variant of > this plan as far as I can tell. Hmm, that isn't a parallel query. I was expecting to see "Gather" and "Parallel" in there. -- Thomas Munro http://www.enterprisedb.com
Re: [sqlsmith] Unpinning error in parallel worker
On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: > I don't have any theories about how that could be going wrong right > now, but I'm looking into it. Thank you! > > I don't have a backtrace yet, but I will provide them if/when the issue > > happens again. > > Thanks, that would be much appreciated, as would any clues about what > workload you're running. Do you know what the query plan looks like > for the queries that crashed? Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung queries were running some variant of this plan as far as I can tell. Thanks again! Jonathan QUERY PLAN -- Sort (cost=1896.13..1896.13 rows=2 width=4336) (actual time=29.396..29.447 rows=410 loops=1) Sort Key: assets."position" Sort Method: quicksort Memory: 1671kB CTE assets -> HashAggregate (cost=1861.88..1863.61 rows=173 width=864) (actual time=23.328..23.883 rows=1242 loops=1) Group Key: assets_7.id, assets_7.name, assets_7.description, assets_7.asset_key, assets_7.type, assets_7.created_at, assets_7.updated_at, assets_7.deleted_at, assets_7.redacted_id, assets_7.section_id, assets_7."position", assets_7.creator_id, assets_7.modifier_id, assets_7.data, assets_7.thumbnail_override, assets_7.availability_start, assets_7.availability_end, assets_7.availability -> Append (cost=853.88..1854.10 rows=173 width=864) (actual time=10.871..22.002 rows=1242 loops=1) -> CTE Scan on assets assets_7 (cost=853.88..861.60 rows=2 width=864) (actual time=10.870..12.339 rows=1242 loops=1) Filter: (redacted_id = 32456) CTE assets -> HashAggregate (cost=850.45..853.88 rows=343 width=864) (actual time=10.867..11.392 rows=1242 loops=1) Group Key: assets_1.id, assets_1.name, assets_1.description, assets_1.asset_key, assets_1.type, assets_1.created_at, assets_1.updated_at, assets_1.deleted_at, assets_1.redacted_id, assets_1.section_id, assets_1."position", assets_1.creator_id, assets_1.modifier_id, assets_1.data, assets_1.thumbnail_override, assets_1.availability_start, assets_1.availability_end, assets_1.availability -> Append (cost=1.41..835.02 rows=343 width=864) (actual time=0.380..9.456 rows=1261 loops=1) -> Nested Loop (cost=1.41..174.78 rows=1 width=249) (actual time=0.046..0.046 rows=0 loops=1) -> Nested Loop (cost=0.98..5.71 rows=1 width=4) (actual time=0.046..0.046 rows=0 loops=1) -> Nested Loop (cost=0.70..5.14 rows=1 width=8) (actual time=0.046..0.046 rows=0 loops=1) -> Index Scan using index_user_id_on_permissible_type on user_permissions (cost=0.42..2.64 rows=1 width=4) (actual time=0.045..0.045 rows=0 loops=1) Index Cond: ((user_id = 98789) AND ((permissible_type)::text = 'Organization'::text)) Filter: ((permission_level)::text = 'guest'::text) -> Index Only Scan using organizations_pkey on organizations (cost=0.28..2.50 rows=1 width=4) (never executed) Index Cond: (id = user_permissions.permissible_id) Heap Fetches: 0 -> Index Scan using index_redacteds_on_organization_id on redacteds redacteds_1 (cost=0.28..0.43 rows=1 width=8) (never executed) Index Cond: (organization_id = organizations.id)
Re: [sqlsmith] Unpinning error in parallel worker
On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenbergwrote: > Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock > and pg_terminate_backend did not terminate the queries. > > In the log: > >> FATAL: cannot unpin a segment that is not pinned Thanks for the report. That error is reachable via two paths: 1. Cleanup of a DSA area at the end of a query, giving back all segments. This is how the bug originally reported in this thread reached it, and that's because of a case where we tried to double-destroy the DSA area when refcount went down to zero, then back up again, and then back to zero (late starting parallel worker that attached in a narrow time window). That was fixed in fddf45b3: once it reaches zero we recognise it as already destroyed and don't even let anyone attach. 2. In destroy_superblock(), called by dsa_free(), when we're where we've determined that a 64kb superblock can be given back to the DSM segment, and that the DSM segment is now entirely free so can be given back to the operating system. To do that, after we put the pages back into the free page manager we test fpm_largest(segment_map->fpm) == segment_map->header->usable_pages to see if the largest span of free pages is now the same size as the whole segment. I don't have any theories about how that could be going wrong right now, but I'm looking into it. There could be a logic bug in dsa.c, or a logic bug in client code running an invalid sequence of dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a well timed double dsa_free() could produce this effect), or a common-or-garden overrun bug somewhere that trashes control state. > I don't have a backtrace yet, but I will provide them if/when the issue > happens again. Thanks, that would be much appreciated, as would any clues about what workload you're running. Do you know what the query plan looks like for the queries that crashed? -- Thomas Munro http://www.enterprisedb.com
Re: [sqlsmith] Unpinning error in parallel worker
On Wed, Mar 29, 2017, at 10:50, Robert Haas wrote: > On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro >wrote: > > I considered whether the error message could be improved but it > > matches the message for an existing similar case (where you try to > > attach to an unknown handle). > > Ugh, OK. I committed this, but I think this whole file needs a visit > from the message style police. I ran into an issue that matches the symptoms of this bug on a production cluster running 10.3 (Ubuntu 10.3-1.pgdg16.04+1) twice in the past week. Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not terminate the queries. In the log: > FATAL: cannot unpin a segment that is not pinned I don't have a backtrace yet, but I will provide them if/when the issue happens again. Jonathan