Re: dsa_allocate() faliure
Hi I just checked the dmesg. The segfault I wrote about is the only one I see, dated Nov 24 last year. Since then no other segfaults happened although dsa_allocated failures happen daily. I'll report if anything occurs. I have the core dumping setup in place. -- regards, pozdrawiam, Jakub Glapa On Sun, Feb 17, 2019 at 11:21 PM Justin Pryzby wrote: > Hi, > > On Mon, Nov 26, 2018 at 09:52:07AM -0600, Justin Pryzby wrote: > > Hi, thanks for following through. > > > > On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote: > > > I had a look at dmesg and indeed I see something like: > > > > > > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030 > > > error 4 in postgres[557833db7000+6d5000] > > > > That's useful, I think "at 0" means a null pointer dereferenced. > > Thomas fixed several bugs in DSA, which will be in next release, postgres > 10.8 > and 11.3. > > However that doesn't explain the segfault you saw, and I don't see anything > which looks relevant changed since in 10.5. > > If you still see that using the latest minor release (10.7), please try to > capture a core file and send a backtrace with a new thread on > pgsql-hackers. > > Thanks, > Justin >
Re: dsa_allocate() faliure
Hi, On Mon, Nov 26, 2018 at 09:52:07AM -0600, Justin Pryzby wrote: > Hi, thanks for following through. > > On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote: > > I had a look at dmesg and indeed I see something like: > > > > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030 > > error 4 in postgres[557833db7000+6d5000] > > That's useful, I think "at 0" means a null pointer dereferenced. Thomas fixed several bugs in DSA, which will be in next release, postgres 10.8 and 11.3. However that doesn't explain the segfault you saw, and I don't see anything which looks relevant changed since in 10.5. If you still see that using the latest minor release (10.7), please try to capture a core file and send a backtrace with a new thread on pgsql-hackers. Thanks, Justin
Re: dsa_allocate() faliure
On Mon, Feb 04, 2019 at 08:31:47PM +, Arne Roland wrote: > I could take a backup and restore the relevant tables on a throwaway system. > You are just suggesting to replace line 728 > elog(FATAL, > "dsa_allocate could not find %zu free > pages", npages); > by > elog(PANIC, > "dsa_allocate could not find %zu free > pages", npages); > correct? Just for my understanding: why would the shutdown of the whole > instance create more helpful logging? You'd also start with pg_ctl -c, which would allow it to dump core, which could be inspected with GDB to show a backtrace and other internals, which up to now nobody (including myself) has been able to provide. Justin
RE: dsa_allocate() faliure
It's definitely a quite a relatively complex pattern. The query I set you last time was minimal with respect to predicates (so removing any single one of the predicates converted that one into a working query). > Huh. Ok well that's a lot more frequent that I thought. Is it always the > same query? Any chance you can get the plan? Are there more things going on > on the server, like perhaps concurrent parallel queries? I had this bug occurring while I was the only one working on the server. I checked there was just one transaction with a snapshot at all and it was a autovacuum busy with a totally unrelated relation my colleague was working on. The bug is indeed behaving like a ghost. One child relation needed a few new rows to test a particular application a colleague of mine was working on. The insert triggered an autoanalyze and the explain changed slightly: Besides row and cost estimates the change is that the line Recheck Cond: (((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, fallback) ) <= 1) AND (gid && '{853078,853080,853082}'::integer[])) is now Recheck Cond: ((gid && '{853078,853080,853082}'::integer[]) AND ((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, fallback) ) <= 1)) and the error vanished. I could try to hunt down another query by assembling seemingly random queries. I don't see a very clear pattern from the queries aborting with this error on our production servers. I'm not surprised that bug is had to chase on production servers. They usually are quite lively. >If you're able to run a throwaway copy of your production database on another >system that you don't have to worry about crashing, you could just replace >ERROR with PANIC and run a high-speed loop of the query that crashed in >product, or something. This might at least tell us whether it's reach that >condition via something dereferencing a dsa_pointer or something manipulating >the segment lists while allocating/freeing. I could take a backup and restore the relevant tables on a throwaway system. You are just suggesting to replace line 728 elog(FATAL, "dsa_allocate could not find %zu free pages", npages); by elog(PANIC, "dsa_allocate could not find %zu free pages", npages); correct? Just for my understanding: why would the shutdown of the whole instance create more helpful logging? All the best Arne
Re: dsa_allocate() faliure
On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa wrote: > I see the error showing up every night on 2 different servers. But it's a bit > of a heisenbug because If I go there now it won't be reproducible. Huh. Ok well that's a lot more frequent that I thought. Is it always the same query? Any chance you can get the plan? Are there more things going on on the server, like perhaps concurrent parallel queries? > It was suggested by Justin Pryzby that I recompile pg src with his patch that > would cause a coredump. Small correction to Justin's suggestion: don't abort() after elog(ERROR, ...), it'll never be reached. > But I don't feel comfortable doing this especially if I would have to run > this with prod data. > My question is. Can I do anything like increasing logging level or enable > some additional options? > It's a production server but I'm willing to sacrifice a bit of it's > performance if that would help. If you're able to run a throwaway copy of your production database on another system that you don't have to worry about crashing, you could just replace ERROR with PANIC and run a high-speed loop of the query that crashed in product, or something. This might at least tell us whether it's reach that condition via something dereferencing a dsa_pointer or something manipulating the segment lists while allocating/freeing. In my own 100% unsuccessful attempts to reproduce this I was mostly running the same query (based on my guess at what ingredients are needed), but perhaps it requires a particular allocation pattern that will require more randomness to reach... hmm. -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
Hi Thomas, I was one of the reporter in the early Dec last year. I somehow dropped the ball and forgot about the issue. Anyhow I upgraded the clusters to pg11.1 and nothing changed. I also have a rule to coredump but a segfault does not happen while this is occurring. I see the error showing up every night on 2 different servers. But it's a bit of a heisenbug because If I go there now it won't be reproducible. It was suggested by Justin Pryzby that I recompile pg src with his patch that would cause a coredump. But I don't feel comfortable doing this especially if I would have to run this with prod data. My question is. Can I do anything like increasing logging level or enable some additional options? It's a production server but I'm willing to sacrifice a bit of it's performance if that would help. -- regards, pozdrawiam, Jakub Glapa On Wed, Jan 30, 2019 at 4:13 AM Thomas Munro wrote: > On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini > wrote: > > we are facing a similar issue on a Production system using a Postgresql > 10.6: > > > > org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; > ID: EXCEPTION on getstatistics_media ; ID: uidatareader. > > run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 > free pages > > > We would like not to stop the Production system and upgrade it to PG11. > And even though would this guarantee a permanent fix? > > Any suggestion? > > Hi Fabio, > > Thanks for your report. Could you please also show the query plan > that runs on the "remote" node (where the error occurred)? > > There is no indication that upgrading to PG11 would help here. It > seems we have an undiagnosed bug (in 10 and 11), and so far no one has > been able to reproduce it at will. I personally have chewed a lot of > CPU time on several machines trying various plan shapes and not seen > this or the possibly related symptom from bug #15585 even once. But > we have about three reports of each of the two symptoms. One reporter > wrote to me off-list to say that they'd seen #15585 twice, the second > time by running the same query in a tight loop for 8 hours, and then > not seen it again in the past 3 weeks. Clearly there is issue needing > a fix here, but I don't yet know what it is. > > -- > Thomas Munro > http://www.enterprisedb.com > >
Re: dsa_allocate() faliure
Hi Thomas, it is a Production system and we don’t have permanent access to it. Also to have an auto_explain feature always on, is not an option in production. I will ask the customer to give us notice asap the error present itself to connect immediately and try to get a query plan. Regards Fabio Isabettini www.voipfuture.com > On 30. Jan 2019, at 04:13:14, Thomas Munro > wrote: > > On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini > wrote: >> we are facing a similar issue on a Production system using a Postgresql 10.6: >> >> org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: >> EXCEPTION on getstatistics_media ; ID: uidatareader. >> run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free >> pages > >> We would like not to stop the Production system and upgrade it to PG11. And >> even though would this guarantee a permanent fix? >> Any suggestion? > > Hi Fabio, > > Thanks for your report. Could you please also show the query plan > that runs on the "remote" node (where the error occurred)? > > There is no indication that upgrading to PG11 would help here. It > seems we have an undiagnosed bug (in 10 and 11), and so far no one has > been able to reproduce it at will. I personally have chewed a lot of > CPU time on several machines trying various plan shapes and not seen > this or the possibly related symptom from bug #15585 even once. But > we have about three reports of each of the two symptoms. One reporter > wrote to me off-list to say that they'd seen #15585 twice, the second > time by running the same query in a tight loop for 8 hours, and then > not seen it again in the past 3 weeks. Clearly there is issue needing > a fix here, but I don't yet know what it is. > > -- > Thomas Munro > http://www.enterprisedb.com >
Re: dsa_allocate() faliure
On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini wrote: > we are facing a similar issue on a Production system using a Postgresql 10.6: > > org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: > EXCEPTION on getstatistics_media ; ID: uidatareader. > run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free > pages > We would like not to stop the Production system and upgrade it to PG11. And > even though would this guarantee a permanent fix? > Any suggestion? Hi Fabio, Thanks for your report. Could you please also show the query plan that runs on the "remote" node (where the error occurred)? There is no indication that upgrading to PG11 would help here. It seems we have an undiagnosed bug (in 10 and 11), and so far no one has been able to reproduce it at will. I personally have chewed a lot of CPU time on several machines trying various plan shapes and not seen this or the possibly related symptom from bug #15585 even once. But we have about three reports of each of the two symptoms. One reporter wrote to me off-list to say that they'd seen #15585 twice, the second time by running the same query in a tight loop for 8 hours, and then not seen it again in the past 3 weeks. Clearly there is issue needing a fix here, but I don't yet know what it is. -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
On Tue, Jan 29, 2019 at 2:50 AM Arne Roland wrote: > does anybody have any idea what goes wrong here? Is there some additional > information that could be helpful? Hi Arne, This seems to be a bug; that error should not be reached. I wonder if it is a different manifestation of the bug reported as #15585 (ie some type of rare corruption). Are you able to reproduce this consistently? Can you please show the query plan? -- Thomas Munro http://www.enterprisedb.com
RE: dsa_allocate() faliure
Hello, does anybody have any idea what goes wrong here? Is there some additional information that could be helpful? All the best Arne Roland
Re: dsa_allocate() faliure
Justin thanks for the information! I'm running Ubuntu 16.04. I'll try to prepare for the next crash. Couldn't find anything this time. -- regards, Jakub Glapa On Mon, Nov 26, 2018 at 4:52 PM Justin Pryzby wrote: > Hi, thanks for following through. > > On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote: > > I had a look at dmesg and indeed I see something like: > > > > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030 > > error 4 in postgres[557833db7000+6d5000] > > That's useful, I think "at 0" means a null pointer dereferenced. > > Can you check /var/log/messages (or ./syslog or similar) and verify the > timestamp matches the time of the last crash (and not an unrelated crash) ? > > The logs might also indicate if the process dumped a core file anywhere. > > I don't know what distribution/OS you're using, but it might be good to > install > abrt (RHEL) or apport (ubuntu) or other mechanism to save coredumps, or to > manually configure /proc/sys/kernel/core_pattern. > > On centos, I usually set: > /etc/abrt/abrt-action-save-package-data.conf > OpenGPGCheck = no > > Also, it might be good to install debug symbols, in case you do find a core > dump now or get one later. > > On centos: yum install postgresql10-debuginfo or debuginfo-install > postgresql10-server > Make sure this exactly matches the debug symbols exactly match the server > version. > > Justin >
Re: dsa_allocate() faliure
Hi, thanks for following through. On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote: > I had a look at dmesg and indeed I see something like: > > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030 > error 4 in postgres[557833db7000+6d5000] That's useful, I think "at 0" means a null pointer dereferenced. Can you check /var/log/messages (or ./syslog or similar) and verify the timestamp matches the time of the last crash (and not an unrelated crash) ? The logs might also indicate if the process dumped a core file anywhere. I don't know what distribution/OS you're using, but it might be good to install abrt (RHEL) or apport (ubuntu) or other mechanism to save coredumps, or to manually configure /proc/sys/kernel/core_pattern. On centos, I usually set: /etc/abrt/abrt-action-save-package-data.conf OpenGPGCheck = no Also, it might be good to install debug symbols, in case you do find a core dump now or get one later. On centos: yum install postgresql10-debuginfo or debuginfo-install postgresql10-server Make sure this exactly matches the debug symbols exactly match the server version. Justin
Re: dsa_allocate() faliure
On Fri, Nov 23, 2018 at 03:31:41PM +0100, Jakub Glapa wrote: > Hi Justin, I've upgrade to 10.6 but the error still shows up: > > If I set it to max_parallel_workers=0 I also get and my connection is being > closed (but the server is alive): > > psql db@host as user => set max_parallel_workers=0; Can you show the plan (explain without analyze) for the nonparallel case? Also, it looks like the server crashed in that case (even if it restarted itself quickly). Can you confirm ? For example: dmesg |tail might show "postmaster[8582]: segfault [...]" or similar. And other clients would've been disconnected. (For example, you'd get an error in another, previously-connected session the next time you run: SELECT 1). In any case, could you try to find a minimal way to reproduce the problem ? I mean, is the dataset and query small and something you can publish, or can you reproduce with data generated from (for example) generate_series() ? Thanks, Justin
Re: dsa_allocate() faliure
Hi Justin, I've upgrade to 10.6 but the error still shows up: psql db@host as user => select version(); version ─ PostgreSQL 10.6 (Ubuntu 10.6-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit (1 row) Time: 110.512 ms psql db@host as user => select from fa where client_id in () and datetime >= '2018/01/01' and ((dims ? 'p' and dimensions ? 'mcp') or (datasource in ('FA', 'GA'))) and not datasource = 'M' GROUP BY datasource, dims ->'ct', dimensions ->'mct', dims -> 'p', dims -> 'sp'; ERROR: XX000: dsa_allocate could not find 7 free pages CONTEXT: parallel worker LOCATION: dsa_allocate_extended, dsa.c:729 Time: 131400.831 ms (02:11.401) the above is execute with max_parallel_workers=8 If I set it to max_parallel_workers=0 I also get and my connection is being closed (but the server is alive): psql db@host as user => set max_parallel_workers=0; SET Time: 89.542 ms psql db@host as user => SELECT ; FATAL: XX000: dsa_allocate could not find 7 free pages LOCATION: dsa_allocate_extended, dsa.c:729 SSL connection has been closed unexpectedly The connection to the server was lost. Attempting reset: Succeeded. Time: 200390.466 ms (03:20.390) -- regards, Jakub Glapa On Thu, Nov 22, 2018 at 5:10 PM Justin Pryzby wrote: > On Wed, Nov 21, 2018 at 03:26:42PM +0100, Jakub Glapa wrote: > > Looks like my email didn't match the right thread: > > > https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com > > Any chance to get some feedback on this? > > In the related thread, it looks like Thomas backpatched a fix to v10, and > so I > guess this should be resolved in 10.6, which was released couple weeks ago. > > https://www.postgresql.org/message-id/CAEepm%3D0QxoUSkFqYbvmxi2eNvvU6BkqH6fTOu4oOzc1MRAT4Dw%40mail.gmail.com > > Could you upgrade and check ? > > 38763d67784c6563d08dbea5c9f913fa174779b8 in master > > |commit ba20d392584cdecc2808fe936448d127f43f2c07 > |Author: Thomas Munro > |Date: Thu Sep 20 15:52:39 2018 +1200 > | > |Fix segment_bins corruption in dsa.c. > > Justin >
Re: dsa_allocate() faliure
On Wed, Nov 21, 2018 at 03:26:42PM +0100, Jakub Glapa wrote: > Looks like my email didn't match the right thread: > https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com > Any chance to get some feedback on this? In the related thread, it looks like Thomas backpatched a fix to v10, and so I guess this should be resolved in 10.6, which was released couple weeks ago. https://www.postgresql.org/message-id/CAEepm%3D0QxoUSkFqYbvmxi2eNvvU6BkqH6fTOu4oOzc1MRAT4Dw%40mail.gmail.com Could you upgrade and check ? 38763d67784c6563d08dbea5c9f913fa174779b8 in master |commit ba20d392584cdecc2808fe936448d127f43f2c07 |Author: Thomas Munro |Date: Thu Sep 20 15:52:39 2018 +1200 | |Fix segment_bins corruption in dsa.c. Justin
Re: dsa_allocate() faliure
Hi, I'm also experiencing the problem: dsa_allocate could not find 7 free pages CONTEXT: parallel worker I'm running: PostgreSQL 10.5 (Ubuntu 10.5-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit query plan: (select statement over parent table to many partitions): select ... from fa where c_id in () and datetime >= '2018/01/01' and ((dims ? 'p' and dims ? 'mcp') or (datasource in (FA', 'GA'))) and not datasource = 'm' GROUP BY datasource, dims ->'ct', dims ->'mcp', dims -> 'p', dims -> 'sp': Finalize GroupAggregate (cost=31514757.77..31519357.77 rows=4 width=223) Group Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims -> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text)) -> Sort (cost=31514757.77..31515057.77 rows=12 width=223) Sort Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims -> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text)) -> Gather (cost=31491634.17..31504634.17 rows=12 width=223) Workers Planned: 3 -> Partial HashAggregate (cost=31490634.17..31491634.17 rows=4 width=223) Group Key: fa.datasource, (fa.dims -> 'ct'::text), (fa.dims -> 'mcp'::text), (fa.dims -> 'p'::text), (fa.dims -> 'sp'::text) -> Result (cost=0.00..31364713.39 rows=5596479 width=175) -> Append (cost=0.00..31252783.81 rows=5596479 width=659) -> Parallel Seq Scan on fa (cost=0.00..0.00 rows=1 width=580) Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text = ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('{}'::bigint[]))) -> Parallel Bitmap Heap Scan on fa_10 (cost=1226.36..53641.49 rows=1 width=1290) Recheck Cond: (datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) Filter: (((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text = ANY (''::bigint[]))) -> Bitmap Index Scan on fa_10_rangestart (cost=0.00..1226.36 rows=32259 width=0) Index Cond: (datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) -> Parallel Seq Scan on fa_105 (cost=0.00..11.99 rows=1 width=580) Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text = ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY (''::bigint[]))) -> Parallel Seq Scan on fa_106 (cost=0.00..11.99 rows=1 width=580) Filter: ((datetime >= '2018-01-01 00:00:00+01'::timestamp with time zone) AND ((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR ((datasource)::text = ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('.. -- regards, Jakub Glapa
Re: dsa_allocate() faliure
On Wed, Aug 29, 2018 at 5:48 PM Sand Stone wrote: > I attached a query (and its query plan) that caused the crash: "dsa_allocate > could not find 13 free pages" on one of the worker nodes. I anonymised the > query text a bit. Interestingly, this time only one (same one) of the nodes > is crashing. Since this is a production environment, I cannot get the stack > trace. Once turned off parallel execution for this node. The whole query > finished just fine. So the parallel query plan is from one of the nodes not > crashed, hopefully the same plan would have been executed on the crashed > node. In theory, every worker node has the same bits, and very similar data. I wonder if this was a different symptom of the problem fixed here: https://www.postgresql.org/message-id/flat/194c0706-c65b-7d81-ab32-2c248c3e2344%402ndquadrant.com Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE? -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
I attached a query (and its query plan) that caused the crash: "dsa_allocate could not find 13 free pages" on one of the worker nodes. I anonymised the query text a bit. Interestingly, this time only one (same one) of the nodes is crashing. Since this is a production environment, I cannot get the stack trace. Once turned off parallel execution for this node. The whole query finished just fine. So the parallel query plan is from one of the nodes not crashed, hopefully the same plan would have been executed on the crashed node. In theory, every worker node has the same bits, and very similar data. === psql (10.4) \dx List of installed extensions Name | Version | Schema |Description +-++--- citus | 7.4-3 | pg_catalog | Citus distributed database hll| 2.10| public | type for storing hyperloglog data plpgsql| 1.0 | pg_catalog | PL/pgSQL procedural language On Sat, Aug 25, 2018 at 7:46 AM Sand Stone wrote: > >Can you still see the problem with Citus 7.4? > Hi, Thomas. I actually went back to the cluster with Citus7.4 and > PG10.4. And modified the parallel param. So far, I haven't seen any > server crash. > > The main difference between crashes observed and no crash, is the set > of Linux TCP time out parameters (to release the ports faster). > Unfortunately, I cannot "undo" the Linux params and run the stress > tests anymore, as this is a multi-million $ cluster and people are > doing more useful things on it. I will keep an eye on any parallel > execution issue. > > > On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro > wrote: > > > > On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone > wrote: > > > Just as a follow up. I tried the parallel execution again (in a stress > > > test environment). Now the crash seems gone. I will keep an eye on > > > this for the next few weeks. > > > > Thanks for the report. That's great news, but it'd be good to > > understand why it was happening. > > > > > My theory is that the Citus cluster created and shut down a lot of TCP > > > connections between coordinator and workers. If running on untuned > > > Linux machines, the TCP ports might run out. > > > > I'm not sure how that's relevant, unless perhaps it causes executor > > nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't > > fix? I wonder if there could be something different about the control > > flow with custom scans, or something about the way Citus worker nodes > > invoke plan fragments, or some error path that I failed to consider... > > It's a clue that all of your worker nodes reliably crashed at the same > > time on the same/similar queries (presumably distributed query > > fragments for different shards), making it seem more like a > > common-or-garden bug rather than some kind of timing-based heisenbug. > > If you ever manage to reproduce it, an explain plan and a back trace > > would be very useful. > > > > > Of course, I am using "newer" PG10 bits and Citus7.5 this time. > > > > Hmm. There weren't any relevant commits to REL_10_STABLE that I can > > think of. And (with the proviso that I know next to nothing about > > Citus) I just cloned https://github.com/citusdata/citus.git and > > skimmed through "git diff origin/release-7.4..origin/release-7.5", and > > nothing is jumping out at me. Can you still see the problem with > > Citus 7.4? > > > > -- > > Thomas Munro > > http://www.enterprisedb.com > explain(analyze, verbose) select atimecol as atimecol, count(1) as count, sum(avg) as sum from ( select partitionid, atimecol as atimecol, avg(sum/count) as avg from atable where atimecol >= '7/31/2018' and atimecol <= '8/28/2018' and aText = 'Foo' and (jsoncol ->> 'X') = '-' and id=12345 and acol = 2 and btimecol='7/31/2018' and btimecol<'8/29/2018' group by partitionid, atimecol ) as subquery group by atimecol; QUERY PLAN --- --- HashAggregate (cost=0.00..0.00 rows=0 width=0) (actual time=535.553..535.553 rows=0 loops=1) Output: remote_scan.atimecol, COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint), sum(remote_scan.sum) Group Key: remote_scan.atimecol -> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=535.551..535.551 rows=0 loops=1)
Re: dsa_allocate() faliure
Just as a follow up. I tried the parallel execution again (in a stress test environment). Now the crash seems gone. I will keep an eye on this for the next few weeks. My theory is that the Citus cluster created and shut down a lot of TCP connections between coordinator and workers. If running on untuned Linux machines, the TCP ports might run out. Of course, I am using "newer" PG10 bits and Citus7.5 this time. On Wed, May 23, 2018 at 7:06 AM Sand Stone wrote: > > >> At which commit ID? > 83fcc615020647268bb129cbf86f7661feee6412 (5/6) > > >>do you mean that these were separate PostgreSQL clusters, and they were all > >>running the same query and they all crashed like this? > A few worker nodes, a table is hash partitioned by "aTable.did" by > Citus, and further partitioned by PG10 by time range on field "ts". As > far as I could tell, Citus just does a query rewrite, and execute the > same type of queries to all nodes. > > >>so this happened at the same time or at different times? > At the same time. The queries are simple count and sum queries, here > is the relevant part from one of the worker nodes: > 2018-05-23 01:24:01.492 UTC [130536] ERROR: dsa_allocate could not > find 7 free pages > 2018-05-23 01:24:01.492 UTC [130536] CONTEXT: parallel worker > STATEMENT: COPY (SELECT count(1) AS count, sum(worker_column_1) AS > sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT > aTable.did, avg((aTable.sum OPERATOR(pg_catalog./) > (aTable.count)::double precision)) AS avg FROM public.aTable_102117 > aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25 > 00:00:00+00'::timestamp with time zone) AND (aTable.ts > OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time > zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY > aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary) > > > >> a parallel worker process > I think this is more of PG10 parallel bg worker issue. I don't think > Citus just lets each worker PG server do its own planning. > > I will try to do more experiments about this, and see if there is any > specific query to cause the parallel query execution to fail. As far > as I can tell, the level of concurrency triggered this issue. That is > executing 10s of queries as shown on the worker nodes, depending on > the stats, the PG10 core may or may not spawn more bg workers. > > Thanks for your time! > > > > > > On Tue, May 22, 2018 at 9:44 PM, Thomas Munro > wrote: > > On Wed, May 23, 2018 at 4:10 PM, Sand Stone wrote: > dsa_allocate could not find 7 free pages > >> I just this error message again on all of my worker nodes (I am using > >> Citus 7.4 rel). The PG core is my own build of release_10_stable > >> (10.4) out of GitHub on Ubuntu. > > > > At which commit ID? > > > > All of your worker nodes... so this happened at the same time or at > > different times? I don't know much about Citus -- do you mean that > > these were separate PostgreSQL clusters, and they were all running the > > same query and they all crashed like this? > > > >> What's the best way to debug this? I am running pre-production tests > >> for the next few days, so I could gather info. if necessary (I cannot > >> pinpoint a query to repro this yet, as we have 10K queries running > >> concurrently). > > > > Any chance of an EXPLAIN plan for the query that crashed like this? > > Do you know if it's using multiple Gather[Merge] nodes and parallel > > bitmap heap scans? Was it a regular backend process or a parallel > > worker process (or a Citus worker process, if that is a thing?) that > > raised the error? > > > > -- > > Thomas Munro > > http://www.enterprisedb.com
Re: dsa_allocate() faliure
>>dsa_allocate could not find 7 free pages I just this error message again on all of my worker nodes (I am using Citus 7.4 rel). The PG core is my own build of release_10_stable (10.4) out of GitHub on Ubuntu. What's the best way to debug this? I am running pre-production tests for the next few days, so I could gather info. if necessary (I cannot pinpoint a query to repro this yet, as we have 10K queries running concurrently). On Mon, Jan 29, 2018 at 1:35 PM, Rick Ottenwrote: > If I do a "set max_parallel_workers_per_gather=0;" before I run the query in > that session, it runs just fine. > If I set it to 2, the query dies with the dsa_allocate error. > > I'll use that as a work around until 10.2 comes out. Thanks! I have > something that will help. > > > On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro > wrote: >> >> On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane wrote: >> > Rick Otten writes: >> >> I'm wondering if there is anything I can tune in my PG 10.1 database to >> >> avoid these errors: >> > >> >> $ psql -f failing_query.sql >> >> psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free >> >> pages >> >> CONTEXT: parallel worker >> > >> > Hmm. There's only one place in the source code that emits that message >> > text: >> > >> > /* >> > * Ask the free page manager for a run of pages. This should >> > always >> > * succeed, since both get_best_segment and make_new_segment >> > should >> > * only return a non-NULL pointer if it actually contains enough >> > * contiguous freespace. If it does fail, something in our >> > backend >> > * private state is out of whack, so use FATAL to kill the >> > process. >> > */ >> > if (!FreePageManagerGet(segment_map->fpm, npages, _page)) >> > elog(FATAL, >> > "dsa_allocate could not find %zu free pages", npages); >> > >> > Now maybe that comment is being unreasonably optimistic, but it sure >> > appears that this is supposed to be a can't-happen case, in which case >> > you've found a bug. >> >> This is probably the bug fixed here: >> >> >> https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org >> >> That was back patched, so 10.2 will contain the fix. The bug was not >> in dsa.c itself, but in the parallel query code that mixed up DSA >> areas, corrupting them. The problem comes up when the query plan has >> multiple Gather nodes (and a particular execution pattern) -- is that >> the case here, in the EXPLAIN output? That seems plausible given the >> description of a 50-branch UNION. The only workaround until 10.2 >> would be to reduce max_parallel_workers_per_gather to 0 to prevent >> parallelism completely for this query. >> >> -- >> Thomas Munro >> http://www.enterprisedb.com > >
Re: dsa_allocate() faliure
On Tue, Jan 30, 2018 at 5:37 AM, Tom Lanewrote: > Rick Otten writes: >> I'm wondering if there is anything I can tune in my PG 10.1 database to >> avoid these errors: > >> $ psql -f failing_query.sql >> psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages >> CONTEXT: parallel worker > > Hmm. There's only one place in the source code that emits that message > text: > > /* > * Ask the free page manager for a run of pages. This should always > * succeed, since both get_best_segment and make_new_segment should > * only return a non-NULL pointer if it actually contains enough > * contiguous freespace. If it does fail, something in our backend > * private state is out of whack, so use FATAL to kill the process. > */ > if (!FreePageManagerGet(segment_map->fpm, npages, _page)) > elog(FATAL, > "dsa_allocate could not find %zu free pages", npages); > > Now maybe that comment is being unreasonably optimistic, but it sure > appears that this is supposed to be a can't-happen case, in which case > you've found a bug. This is probably the bug fixed here: https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org That was back patched, so 10.2 will contain the fix. The bug was not in dsa.c itself, but in the parallel query code that mixed up DSA areas, corrupting them. The problem comes up when the query plan has multiple Gather nodes (and a particular execution pattern) -- is that the case here, in the EXPLAIN output? That seems plausible given the description of a 50-branch UNION. The only workaround until 10.2 would be to reduce max_parallel_workers_per_gather to 0 to prevent parallelism completely for this query. -- Thomas Munro http://www.enterprisedb.com
Re: dsa_allocate() faliure
Rick Ottenwrites: > I'm wondering if there is anything I can tune in my PG 10.1 database to > avoid these errors: > $ psql -f failing_query.sql > psql:failing_query.sql:46: ERROR: dsa_allocate could not find 7 free pages > CONTEXT: parallel worker Hmm. There's only one place in the source code that emits that message text: /* * Ask the free page manager for a run of pages. This should always * succeed, since both get_best_segment and make_new_segment should * only return a non-NULL pointer if it actually contains enough * contiguous freespace. If it does fail, something in our backend * private state is out of whack, so use FATAL to kill the process. */ if (!FreePageManagerGet(segment_map->fpm, npages, _page)) elog(FATAL, "dsa_allocate could not find %zu free pages", npages); Now maybe that comment is being unreasonably optimistic, but it sure appears that this is supposed to be a can't-happen case, in which case you've found a bug. cc'ing the DSA authors for comment. regards, tom lane