Re: dsa_allocate() faliure

2019-02-18 Thread Jakub Glapa
Hi I just checked the dmesg.
The segfault I wrote about is the only one I see, dated Nov 24 last year.
Since then no other segfaults happened although dsa_allocated failures
happen daily.
I'll report if anything occurs.
I have the core dumping setup in place.



--
regards,
pozdrawiam,
Jakub Glapa


On Sun, Feb 17, 2019 at 11:21 PM Justin Pryzby  wrote:

> Hi,
>
> On Mon, Nov 26, 2018 at 09:52:07AM -0600, Justin Pryzby wrote:
> > Hi, thanks for following through.
> >
> > On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> > > I had a look at dmesg and indeed I see something like:
> > >
> > > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030
> > > error 4 in postgres[557833db7000+6d5000]
> >
> > That's useful, I think "at 0" means a null pointer dereferenced.
>
> Thomas fixed several bugs in DSA, which will be in next release, postgres
> 10.8
> and 11.3.
>
> However that doesn't explain the segfault you saw, and I don't see anything
> which looks relevant changed since in 10.5.
>
> If you still see that using the latest minor release (10.7), please try to
> capture a core file and send a backtrace with a new thread on
> pgsql-hackers.
>
> Thanks,
> Justin
>


Re: dsa_allocate() faliure

2019-02-17 Thread Justin Pryzby
Hi,

On Mon, Nov 26, 2018 at 09:52:07AM -0600, Justin Pryzby wrote:
> Hi, thanks for following through.
> 
> On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> > I had a look at dmesg and indeed I see something like:
> > 
> > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030
> > error 4 in postgres[557833db7000+6d5000]
> 
> That's useful, I think "at 0" means a null pointer dereferenced.

Thomas fixed several bugs in DSA, which will be in next release, postgres 10.8
and 11.3.

However that doesn't explain the segfault you saw, and I don't see anything
which looks relevant changed since in 10.5.

If you still see that using the latest minor release (10.7), please try to
capture a core file and send a backtrace with a new thread on pgsql-hackers.

Thanks,
Justin



Re: dsa_allocate() faliure

2019-02-04 Thread Justin Pryzby
On Mon, Feb 04, 2019 at 08:31:47PM +, Arne Roland wrote:
> I could take a backup and restore the relevant tables on a throwaway system. 
> You are just suggesting to replace line 728
> elog(FATAL,
>  "dsa_allocate could not find %zu free 
> pages", npages);
> by
> elog(PANIC,
>  "dsa_allocate could not find %zu free 
> pages", npages);
> correct? Just for my understanding: why would the shutdown of the whole 
> instance create more helpful logging?

You'd also start with pg_ctl -c, which would allow it to dump core, which could
be inspected with GDB to show a backtrace and other internals, which up to now
nobody (including myself) has been able to provide.

Justin



RE: dsa_allocate() faliure

2019-02-04 Thread Arne Roland
It's definitely a quite a relatively complex pattern. The query I set you last 
time was minimal with respect to predicates (so removing any single one of the 
predicates converted that one into a working query).
> Huh.  Ok well that's a lot more frequent that I thought.  Is it always the 
> same query?  Any chance you can get the plan?  Are there more things going on 
> on the server, like perhaps concurrent parallel queries?
I had this bug occurring while I was the only one working on the server. I 
checked there was just one transaction with a snapshot at all and it was a 
autovacuum busy with a totally unrelated relation my colleague was working on.

The bug is indeed behaving like a ghost.
One child relation needed a few new rows to test a particular application a 
colleague of mine was working on. The insert triggered an autoanalyze and the 
explain changed slightly:
Besides row and cost estimates the change is that the line
Recheck Cond: (((COALESCE((fid)::bigint, fallback) ) >= 1) AND 
((COALESCE((fid)::bigint, fallback) ) <= 1) AND (gid && 
'{853078,853080,853082}'::integer[]))
is now 
Recheck Cond: ((gid && '{853078,853080,853082}'::integer[]) AND 
((COALESCE((fid)::bigint, fallback) ) >= 1) AND ((COALESCE((fid)::bigint, 
fallback) ) <= 1))
and the error vanished.

I could try to hunt down another query by assembling seemingly random queries. 
I don't see a very clear pattern from the queries aborting with this error on 
our production servers. I'm not surprised that bug is had to chase on 
production servers. They usually are quite lively.

>If you're able to run a throwaway copy of your production database on another 
>system that you don't have to worry about crashing, you could just replace 
>ERROR with PANIC and run a high-speed loop of the query that crashed in 
>product, or something.  This might at least tell us whether it's reach that 
>condition via something dereferencing a dsa_pointer or something manipulating 
>the segment lists while allocating/freeing.

I could take a backup and restore the relevant tables on a throwaway system. 
You are just suggesting to replace line 728
elog(FATAL,
 "dsa_allocate could not find %zu free pages", 
npages);
by
elog(PANIC,
 "dsa_allocate could not find %zu free pages", 
npages);
correct? Just for my understanding: why would the shutdown of the whole 
instance create more helpful logging?

All the best
Arne


Re: dsa_allocate() faliure

2019-02-04 Thread Thomas Munro
On Mon, Feb 4, 2019 at 6:52 PM Jakub Glapa  wrote:
> I see the error showing up every night on 2 different servers. But it's a bit 
> of a heisenbug because If I go there now it won't be reproducible.

Huh.  Ok well that's a lot more frequent that I thought.  Is it always
the same query?  Any chance you can get the plan?  Are there more
things going on on the server, like perhaps concurrent parallel
queries?

> It was suggested by Justin Pryzby that I recompile pg src with his patch that 
> would cause a coredump.

Small correction to Justin's suggestion: don't abort() after
elog(ERROR, ...), it'll never be reached.

> But I don't feel comfortable doing this especially if I would have to run 
> this with prod data.
> My question is. Can I do anything like increasing logging level or enable 
> some additional options?
> It's a production server but I'm willing to sacrifice a bit of it's 
> performance if that would help.

If you're able to run a throwaway copy of your production database on
another system that you don't have to worry about crashing, you could
just replace ERROR with PANIC and run a high-speed loop of the query
that crashed in product, or something.  This might at least tell us
whether it's reach that condition via something dereferencing a
dsa_pointer or something manipulating the segment lists while
allocating/freeing.

In my own 100% unsuccessful attempts to reproduce this I was mostly
running the same query (based on my guess at what ingredients are
needed), but perhaps it requires a particular allocation pattern that
will require more randomness to reach... hmm.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2019-02-03 Thread Jakub Glapa
Hi Thomas,
I was one of the reporter in the early Dec last year.
I somehow dropped the ball and forgot about the issue.
Anyhow I upgraded the clusters to pg11.1 and nothing changed. I also have a
rule to coredump but a segfault does not happen while this is occurring.
I see the error showing up every night on 2 different servers. But it's a
bit of a heisenbug because If I go there now it won't be reproducible.
It was suggested by Justin Pryzby that I recompile pg src with his patch
that would cause a coredump.
But I don't feel comfortable doing this especially if I would have to run
this with prod data.
My question is. Can I do anything like increasing logging level or enable
some additional options?
It's a production server but I'm willing to sacrifice a bit of it's
performance if that would help.


--
regards,
pozdrawiam,
Jakub Glapa


On Wed, Jan 30, 2019 at 4:13 AM Thomas Munro 
wrote:

> On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
>  wrote:
> >  we are facing a similar issue on a Production system using a Postgresql
> 10.6:
> >
> > org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ;
> ID: EXCEPTION on getstatistics_media ; ID: uidatareader.
> > run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7
> free pages
>
> > We would like not to stop the Production system and upgrade it to PG11.
> And even though would this guarantee a permanent fix?
> > Any suggestion?
>
> Hi Fabio,
>
> Thanks for your report.  Could you please also show the query plan
> that runs on the "remote" node (where the error occurred)?
>
> There is no indication that upgrading to PG11 would help here.  It
> seems we have an undiagnosed bug (in 10 and 11), and so far no one has
> been able to reproduce it at will.  I personally have chewed a lot of
> CPU time on several machines trying various plan shapes and not seen
> this or the possibly related symptom from bug #15585 even once.  But
> we have about three reports of each of the two symptoms.  One reporter
> wrote to me off-list to say that they'd seen #15585 twice, the second
> time by running the same query in a tight loop for 8 hours, and then
> not seen it again in the past 3 weeks.  Clearly there is issue needing
> a fix here, but I don't yet know what it is.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>
>


Re: dsa_allocate() faliure

2019-01-30 Thread Fabio Isabettini
Hi Thomas,
it is a Production system and we don’t have permanent access to it.
Also to have an auto_explain feature always on, is not an option in production.
I will ask the customer to give us notice asap the error present itself to 
connect immediately and try to get a query plan.

Regards

Fabio Isabettini
www.voipfuture.com 

> On 30. Jan 2019, at 04:13:14, Thomas Munro  
> wrote:
> 
> On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
>  wrote:
>> we are facing a similar issue on a Production system using a Postgresql 10.6:
>> 
>> org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: 
>> EXCEPTION on getstatistics_media ; ID: uidatareader.
>> run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free 
>> pages
> 
>> We would like not to stop the Production system and upgrade it to PG11. And 
>> even though would this guarantee a permanent fix?
>> Any suggestion?
> 
> Hi Fabio,
> 
> Thanks for your report.  Could you please also show the query plan
> that runs on the "remote" node (where the error occurred)?
> 
> There is no indication that upgrading to PG11 would help here.  It
> seems we have an undiagnosed bug (in 10 and 11), and so far no one has
> been able to reproduce it at will.  I personally have chewed a lot of
> CPU time on several machines trying various plan shapes and not seen
> this or the possibly related symptom from bug #15585 even once.  But
> we have about three reports of each of the two symptoms.  One reporter
> wrote to me off-list to say that they'd seen #15585 twice, the second
> time by running the same query in a tight loop for 8 hours, and then
> not seen it again in the past 3 weeks.  Clearly there is issue needing
> a fix here, but I don't yet know what it is.
> 
> -- 
> Thomas Munro
> http://www.enterprisedb.com
> 





Re: dsa_allocate() faliure

2019-01-29 Thread Thomas Munro
On Tue, Jan 29, 2019 at 10:32 PM Fabio Isabettini
 wrote:
>  we are facing a similar issue on a Production system using a Postgresql 10.6:
>
> org.postgresql.util.PSQLException: ERROR: EXCEPTION on getstatistics ; ID: 
> EXCEPTION on getstatistics_media ; ID: uidatareader.
> run_query_media(2): [a1] REMOTE FATAL: dsa_allocate could not find 7 free 
> pages

> We would like not to stop the Production system and upgrade it to PG11. And 
> even though would this guarantee a permanent fix?
> Any suggestion?

Hi Fabio,

Thanks for your report.  Could you please also show the query plan
that runs on the "remote" node (where the error occurred)?

There is no indication that upgrading to PG11 would help here.  It
seems we have an undiagnosed bug (in 10 and 11), and so far no one has
been able to reproduce it at will.  I personally have chewed a lot of
CPU time on several machines trying various plan shapes and not seen
this or the possibly related symptom from bug #15585 even once.  But
we have about three reports of each of the two symptoms.  One reporter
wrote to me off-list to say that they'd seen #15585 twice, the second
time by running the same query in a tight loop for 8 hours, and then
not seen it again in the past 3 weeks.  Clearly there is issue needing
a fix here, but I don't yet know what it is.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2019-01-28 Thread Thomas Munro
On Tue, Jan 29, 2019 at 2:50 AM Arne Roland  wrote:
> does anybody have any idea what goes wrong here? Is there some additional 
> information that could be helpful?

Hi Arne,

This seems to be a bug; that error should not be reached.  I wonder if
it is a different manifestation of the bug reported as #15585 (ie some
type of rare corruption).  Are you able to reproduce this
consistently?  Can you please show the query plan?

-- 
Thomas Munro
http://www.enterprisedb.com



RE: dsa_allocate() faliure

2019-01-28 Thread Arne Roland
Hello,

does anybody have any idea what goes wrong here? Is there some additional 
information that could be helpful?

All the best
Arne Roland


Re: dsa_allocate() faliure

2018-11-26 Thread Jakub Glapa
Justin thanks for the information!
I'm running Ubuntu 16.04.
I'll try to prepare for the next crash.
Couldn't find anything this time.


--
regards,
Jakub Glapa


On Mon, Nov 26, 2018 at 4:52 PM Justin Pryzby  wrote:

> Hi, thanks for following through.
>
> On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> > I had a look at dmesg and indeed I see something like:
> >
> > postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030
> > error 4 in postgres[557833db7000+6d5000]
>
> That's useful, I think "at 0" means a null pointer dereferenced.
>
> Can you check /var/log/messages (or ./syslog or similar) and verify the
> timestamp matches the time of the last crash (and not an unrelated crash) ?
>
> The logs might also indicate if the process dumped a core file anywhere.
>
> I don't know what distribution/OS you're using, but it might be good to
> install
> abrt (RHEL) or apport (ubuntu) or other mechanism to save coredumps, or to
> manually configure /proc/sys/kernel/core_pattern.
>
> On centos, I usually set:
> /etc/abrt/abrt-action-save-package-data.conf
> OpenGPGCheck = no
>
> Also, it might be good to install debug symbols, in case you do find a core
> dump now or get one later.
>
> On centos: yum install postgresql10-debuginfo or debuginfo-install
> postgresql10-server
> Make sure this exactly matches the debug symbols exactly match the server
> version.
>
> Justin
>


Re: dsa_allocate() faliure

2018-11-26 Thread Justin Pryzby
Hi, thanks for following through.

On Mon, Nov 26, 2018 at 04:38:35PM +0100, Jakub Glapa wrote:
> I had a look at dmesg and indeed I see something like:
> 
> postgres[30667]: segfault at 0 ip 557834264b16 sp 7ffc2ce1e030
> error 4 in postgres[557833db7000+6d5000]

That's useful, I think "at 0" means a null pointer dereferenced.

Can you check /var/log/messages (or ./syslog or similar) and verify the
timestamp matches the time of the last crash (and not an unrelated crash) ?

The logs might also indicate if the process dumped a core file anywhere. 

I don't know what distribution/OS you're using, but it might be good to install
abrt (RHEL) or apport (ubuntu) or other mechanism to save coredumps, or to
manually configure /proc/sys/kernel/core_pattern.

On centos, I usually set:
/etc/abrt/abrt-action-save-package-data.conf
OpenGPGCheck = no

Also, it might be good to install debug symbols, in case you do find a core
dump now or get one later.

On centos: yum install postgresql10-debuginfo or debuginfo-install 
postgresql10-server
Make sure this exactly matches the debug symbols exactly match the server 
version.

Justin



Re: dsa_allocate() faliure

2018-11-23 Thread Justin Pryzby
On Fri, Nov 23, 2018 at 03:31:41PM +0100, Jakub Glapa wrote:
> Hi Justin, I've upgrade to 10.6 but the error still shows up:
> 
> If I set it to max_parallel_workers=0 I also get and my connection is being
> closed (but the server is alive):
> 
> psql db@host as user => set max_parallel_workers=0;

Can you show the plan (explain without analyze) for the nonparallel case?

Also, it looks like the server crashed in that case (even if it restarted
itself quickly).  Can you confirm ?

For example: dmesg |tail might show "postmaster[8582]: segfault [...]" or
similar.  And other clients would've been disconnected.  (For example, you'd
get an error in another, previously-connected session the next time you run:
SELECT 1).

In any case, could you try to find a minimal way to reproduce the problem ?  I
mean, is the dataset and query small and something you can publish, or can you
reproduce with data generated from (for example) generate_series() ?

Thanks,
Justin



Re: dsa_allocate() faliure

2018-11-23 Thread Jakub Glapa
Hi Justin, I've upgrade to 10.6 but the error still shows up:

psql db@host as user => select version();
   version

─
 PostgreSQL 10.6 (Ubuntu 10.6-1.pgdg16.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609, 64-bit
(1 row)

Time: 110.512 ms

psql db@host as user  => select  from fa where client_id in
() and datetime >= '2018/01/01' and ((dims ? 'p' and dimensions ?
'mcp') or (datasource in ('FA', 'GA'))) and not datasource = 'M' GROUP BY
datasource, dims ->'ct', dimensions ->'mct', dims -> 'p', dims -> 'sp';
ERROR:  XX000: dsa_allocate could not find 7 free pages
CONTEXT:  parallel worker
LOCATION:  dsa_allocate_extended, dsa.c:729
Time: 131400.831 ms (02:11.401)

the above is execute with max_parallel_workers=8
If I set it to max_parallel_workers=0 I also get and my connection is being
closed (but the server is alive):

psql db@host as user => set max_parallel_workers=0;
SET
Time: 89.542 ms
psql db@host as user => SELECT ;
FATAL:  XX000: dsa_allocate could not find 7 free pages
LOCATION:  dsa_allocate_extended, dsa.c:729
SSL connection has been closed unexpectedly
The connection to the server was lost. Attempting reset: Succeeded.
Time: 200390.466 ms (03:20.390)



--
regards,
Jakub Glapa


On Thu, Nov 22, 2018 at 5:10 PM Justin Pryzby  wrote:

> On Wed, Nov 21, 2018 at 03:26:42PM +0100, Jakub Glapa wrote:
> > Looks like my email didn't match the right thread:
> >
> https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com
> > Any chance to get some feedback on this?
>
> In the related thread, it looks like Thomas backpatched a fix to v10, and
> so I
> guess this should be resolved in 10.6, which was released couple weeks ago.
>
> https://www.postgresql.org/message-id/CAEepm%3D0QxoUSkFqYbvmxi2eNvvU6BkqH6fTOu4oOzc1MRAT4Dw%40mail.gmail.com
>
> Could you upgrade and check ?
>
> 38763d67784c6563d08dbea5c9f913fa174779b8 in master
>
> |commit ba20d392584cdecc2808fe936448d127f43f2c07
> |Author: Thomas Munro 
> |Date:   Thu Sep 20 15:52:39 2018 +1200
> |
> |Fix segment_bins corruption in dsa.c.
>
> Justin
>


Re: dsa_allocate() faliure

2018-11-22 Thread Justin Pryzby
On Wed, Nov 21, 2018 at 03:26:42PM +0100, Jakub Glapa wrote:
> Looks like my email didn't match the right thread:
> https://www.postgresql.org/message-id/flat/CAMAYy4%2Bw3NTBM5JLWFi8twhWK4%3Dk_5L4nV5%2BbYDSPu8r4b97Zg%40mail.gmail.com
> Any chance to get some feedback on this?

In the related thread, it looks like Thomas backpatched a fix to v10, and so I
guess this should be resolved in 10.6, which was released couple weeks ago.
https://www.postgresql.org/message-id/CAEepm%3D0QxoUSkFqYbvmxi2eNvvU6BkqH6fTOu4oOzc1MRAT4Dw%40mail.gmail.com

Could you upgrade and check ?

38763d67784c6563d08dbea5c9f913fa174779b8 in master

|commit ba20d392584cdecc2808fe936448d127f43f2c07
|Author: Thomas Munro 
|Date:   Thu Sep 20 15:52:39 2018 +1200
|
|Fix segment_bins corruption in dsa.c.

Justin



Re: dsa_allocate() faliure

2018-11-13 Thread Jakub Glapa
Hi, I'm also experiencing the problem: dsa_allocate could not find 7 free
pages CONTEXT: parallel worker

I'm running: PostgreSQL 10.5 (Ubuntu 10.5-1.pgdg16.04+1) on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
20160609, 64-bit

query plan: (select statement over parent table to many partitions):
select ...
from fa
where c_id in () and
datetime >= '2018/01/01'
and ((dims ? 'p' and dims ? 'mcp')
or (datasource in (FA', 'GA')))
and not datasource = 'm'
GROUP BY datasource, dims ->'ct', dims ->'mcp', dims -> 'p', dims -> 'sp':

Finalize GroupAggregate  (cost=31514757.77..31519357.77 rows=4 width=223)
   Group Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims ->
'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text))
   ->  Sort  (cost=31514757.77..31515057.77 rows=12 width=223)
 Sort Key: fa.datasource, ((fa.dims -> 'ct'::text)), ((fa.dims
-> 'mcp'::text)), ((fa.dims -> 'p'::text)), ((fa.dims -> 'sp'::text))
 ->  Gather  (cost=31491634.17..31504634.17 rows=12 width=223)
   Workers Planned: 3
   ->  Partial HashAggregate
(cost=31490634.17..31491634.17 rows=4 width=223)
 Group Key: fa.datasource, (fa.dims ->
'ct'::text), (fa.dims -> 'mcp'::text), (fa.dims -> 'p'::text),
(fa.dims -> 'sp'::text)
 ->  Result  (cost=0.00..31364713.39 rows=5596479 width=175)
   ->  Append  (cost=0.00..31252783.81
rows=5596479 width=659)
 ->  Parallel Seq Scan on fa
(cost=0.00..0.00 rows=1 width=580)
   Filter: ((datetime >=
'2018-01-01 00:00:00+01'::timestamp with time zone) AND
((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ?
'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('{}'::bigint[])))
 ->  Parallel Bitmap Heap Scan on
fa_10  (cost=1226.36..53641.49 rows=1 width=1290)
   Recheck Cond: (datetime >=
'2018-01-01 00:00:00+01'::timestamp with time zone)
   Filter: (((datasource)::text <>
'M'::text) AND (((dims ? 'p'::text) AND (dims ? 'mcp'::text)) OR
((datasource)::text = ANY (''::bigint[])))
   ->  Bitmap Index Scan on
fa_10_rangestart  (cost=0.00..1226.36 rows=32259 width=0)
 Index Cond: (datetime >=
'2018-01-01 00:00:00+01'::timestamp with time zone)
 ->  Parallel Seq Scan on fa_105
(cost=0.00..11.99 rows=1 width=580)
   Filter: ((datetime >=
'2018-01-01 00:00:00+01'::timestamp with time zone) AND
((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ?
'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY (''::bigint[])))
 ->  Parallel Seq Scan on fa_106
(cost=0.00..11.99 rows=1 width=580)
   Filter: ((datetime >=
'2018-01-01 00:00:00+01'::timestamp with time zone) AND
((datasource)::text <> 'M'::text) AND (((dims ? 'p'::text) AND (dims ?
'mcp'::text)) OR ((datasource)::text =
ANY ('{"FA","GA"}'::text[]))) AND (c_id = ANY ('..



--
regards,
Jakub Glapa


Re: dsa_allocate() faliure

2018-10-04 Thread Thomas Munro
On Wed, Aug 29, 2018 at 5:48 PM Sand Stone  wrote:
> I attached a query (and its query plan) that caused the crash: "dsa_allocate 
> could not find 13 free pages" on one of the worker nodes. I anonymised the 
> query text a bit.  Interestingly, this time only one (same one) of the nodes 
> is crashing. Since this is a production environment, I cannot get the stack 
> trace. Once turned off parallel execution for this node. The whole query 
> finished just fine. So the parallel query plan is from one of the nodes not 
> crashed, hopefully the same plan would have been executed on the crashed 
> node. In theory, every worker node has the same bits, and very similar data.

I wonder if this was a different symptom of the problem fixed here:

https://www.postgresql.org/message-id/flat/194c0706-c65b-7d81-ab32-2c248c3e2344%402ndquadrant.com

Can you still reproduce it on current master, REL_11_STABLE or REL_10_STABLE?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2018-08-28 Thread Sand Stone
I attached a query (and its query plan) that caused the crash:
"dsa_allocate could not find 13 free pages" on one of the worker nodes. I
anonymised the query text a bit.  Interestingly, this time only one (same
one) of the nodes is crashing. Since this is a production environment, I
cannot get the stack trace. Once turned off parallel execution for this
node. The whole query finished just fine. So the parallel query plan is
from one of the nodes not crashed, hopefully the same plan would have been
executed on the crashed node. In theory, every worker node has the same
bits, and very similar data.

===
psql (10.4)
\dx
   List of installed extensions
  Name  | Version |   Schema   |Description
+-++---
 citus  | 7.4-3   | pg_catalog | Citus distributed database
 hll| 2.10| public | type for storing hyperloglog data
plpgsql| 1.0 | pg_catalog | PL/pgSQL procedural language


On Sat, Aug 25, 2018 at 7:46 AM Sand Stone  wrote:

> >Can you still see the problem with Citus 7.4?
> Hi, Thomas. I actually went back to the cluster with Citus7.4 and
> PG10.4. And modified the parallel param. So far, I haven't seen any
> server crash.
>
> The main difference between crashes observed and no crash, is the set
> of Linux TCP time out parameters (to release the ports faster).
> Unfortunately, I cannot "undo" the Linux params and run the stress
> tests anymore, as this is a multi-million $ cluster and people are
> doing more useful things on it. I will keep an eye on any parallel
> execution issue.
>
>
> On Wed, Aug 15, 2018 at 3:43 PM Thomas Munro
>  wrote:
> >
> > On Thu, Aug 16, 2018 at 8:32 AM, Sand Stone 
> wrote:
> > > Just as a follow up. I tried the parallel execution again (in a stress
> > > test environment). Now the crash seems gone. I will keep an eye on
> > > this for the next few weeks.
> >
> > Thanks for the report.  That's great news, but it'd be good to
> > understand why it was happening.
> >
> > > My theory is that the Citus cluster created and shut down a lot of TCP
> > > connections between coordinator and workers. If running on untuned
> > > Linux machines, the TCP ports might run out.
> >
> > I'm not sure how that's relevant, unless perhaps it causes executor
> > nodes to be invoked in a strange sequence that commit fd7c0fa7 didn't
> > fix?  I wonder if there could be something different about the control
> > flow with custom scans, or something about the way Citus worker nodes
> > invoke plan fragments, or some error path that I failed to consider...
> > It's a clue that all of your worker nodes reliably crashed at the same
> > time on the same/similar queries (presumably distributed query
> > fragments for different shards), making it seem more like a
> > common-or-garden bug rather than some kind of timing-based heisenbug.
> > If you ever manage to reproduce it, an explain plan and a back trace
> > would be very useful.
> >
> > > Of course, I am using "newer" PG10 bits and Citus7.5 this time.
> >
> > Hmm.  There weren't any relevant commits to REL_10_STABLE that I can
> > think of.  And (with the proviso that I know next to nothing about
> > Citus) I just cloned https://github.com/citusdata/citus.git and
> > skimmed through "git diff origin/release-7.4..origin/release-7.5", and
> > nothing is jumping out at me.  Can you still see the problem with
> > Citus 7.4?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com
>
explain(analyze, verbose) select
atimecol as atimecol, count(1) as count, sum(avg) as sum
from
(
select
 partitionid, atimecol as atimecol, avg(sum/count) as avg
from atable
where atimecol >= '7/31/2018' and
atimecol <= '8/28/2018' and
aText = 'Foo' and
(jsoncol ->> 'X') = '-' and id=12345 and acol = 2 and btimecol='7/31/2018' and 
btimecol<'8/29/2018'
group by partitionid, atimecol
) as subquery
group by atimecol;


QUERY PLAN

---
---
 HashAggregate  (cost=0.00..0.00 rows=0 width=0) (actual time=535.553..535.553 
rows=0 loops=1)
   Output: remote_scan.atimecol, 
COALESCE((pg_catalog.sum(remote_scan.count))::bigint, '0'::bigint), 
sum(remote_scan.sum)
   Group Key: remote_scan.atimecol
   ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0) (actual 
time=535.551..535.551 rows=0 loops=1)
   

Re: dsa_allocate() faliure

2018-08-15 Thread Sand Stone
Just as a follow up. I tried the parallel execution again (in a stress
test environment). Now the crash seems gone. I will keep an eye on
this for the next few weeks.

My theory is that the Citus cluster created and shut down a lot of TCP
connections between coordinator and workers. If running on untuned
Linux machines, the TCP ports might run out.

Of course, I am using "newer" PG10 bits and Citus7.5 this time.
On Wed, May 23, 2018 at 7:06 AM Sand Stone  wrote:
>
> >> At which commit ID?
> 83fcc615020647268bb129cbf86f7661feee6412 (5/6)
>
> >>do you mean that these were separate PostgreSQL clusters, and they were all 
> >>running the same query and they all crashed like this?
> A few worker nodes, a table is hash partitioned by "aTable.did" by
> Citus, and further partitioned by PG10 by time range on field "ts". As
> far as I could tell, Citus just does a query rewrite, and execute the
> same type of queries to all nodes.
>
> >>so this happened at the same time or at different times?
> At the same time. The queries are simple count and sum queries, here
> is the relevant part from one of the worker nodes:
> 2018-05-23 01:24:01.492 UTC [130536] ERROR:  dsa_allocate could not
> find 7 free pages
> 2018-05-23 01:24:01.492 UTC [130536] CONTEXT:  parallel worker
> STATEMENT:  COPY (SELECT count(1) AS count, sum(worker_column_1) AS
> sum FROM (SELECT subquery.avg AS worker_column_1 FROM (SELECT
> aTable.did, avg((aTable.sum OPERATOR(pg_catalog./)
> (aTable.count)::double precision)) AS avg FROM public.aTable_102117
> aTable WHERE ((aTable.ts OPERATOR(pg_catalog.>=) '2018-04-25
> 00:00:00+00'::timestamp with time zone) AND (aTable.ts
> OPERATOR(pg_catalog.<=) '2018-04-30 00:00:00+00'::timestamp with time
> zone) AND (aTable.v OPERATOR(pg_catalog.=) 12345)) GROUP BY
> aTable.did) subquery) worker_subquery) TO STDOUT WITH (FORMAT binary)
>
>
> >> a parallel worker process
> I think this is more of PG10 parallel bg worker issue. I don't think
> Citus just lets each worker PG server do its own planning.
>
> I will try to do more experiments about this, and see if there is any
> specific query to cause the parallel query execution to fail. As far
> as I can tell, the level of concurrency triggered this issue. That is
> executing 10s of queries as shown on the worker nodes, depending on
> the stats, the PG10 core may or may not spawn more bg workers.
>
> Thanks for your time!
>
>
>
>
>
> On Tue, May 22, 2018 at 9:44 PM, Thomas Munro
>  wrote:
> > On Wed, May 23, 2018 at 4:10 PM, Sand Stone  wrote:
> dsa_allocate could not find 7 free pages
> >> I just this error message again on all of my worker nodes (I am using
> >> Citus 7.4 rel). The PG core is my own build of release_10_stable
> >> (10.4) out of GitHub on Ubuntu.
> >
> > At which commit ID?
> >
> > All of your worker nodes... so this happened at the same time or at
> > different times?  I don't know much about Citus -- do you mean that
> > these were separate PostgreSQL clusters, and they were all running the
> > same query and they all crashed like this?
> >
> >> What's the best way to debug this? I am running pre-production tests
> >> for the next few days, so I could gather info. if necessary (I cannot
> >> pinpoint a query to repro this yet, as we have 10K queries running
> >> concurrently).
> >
> > Any chance of an EXPLAIN plan for the query that crashed like this?
> > Do you know if it's using multiple Gather[Merge] nodes and parallel
> > bitmap heap scans?  Was it a regular backend process or a parallel
> > worker process (or a Citus worker process, if that is a thing?) that
> > raised the error?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com



Re: dsa_allocate() faliure

2018-05-22 Thread Sand Stone
>>dsa_allocate could not find 7 free pages
I just this error message again on all of my worker nodes (I am using
Citus 7.4 rel). The PG core is my own build of release_10_stable
(10.4) out of GitHub on Ubuntu.

What's the best way to debug this? I am running pre-production tests
for the next few days, so I could gather info. if necessary (I cannot
pinpoint a query to repro this yet, as we have 10K queries running
concurrently).




On Mon, Jan 29, 2018 at 1:35 PM, Rick Otten  wrote:
> If I do a "set max_parallel_workers_per_gather=0;" before I run the query in
> that session, it runs just fine.
> If I set it to 2, the query dies with the dsa_allocate error.
>
> I'll use that as a work around until 10.2 comes out.  Thanks!  I have
> something that will help.
>
>
> On Mon, Jan 29, 2018 at 3:52 PM, Thomas Munro
>  wrote:
>>
>> On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane  wrote:
>> > Rick Otten  writes:
>> >> I'm wondering if there is anything I can tune in my PG 10.1 database to
>> >> avoid these errors:
>> >
>> >> $  psql -f failing_query.sql
>> >> psql:failing_query.sql:46: ERROR:  dsa_allocate could not find 7 free
>> >> pages
>> >> CONTEXT:  parallel worker
>> >
>> > Hmm.  There's only one place in the source code that emits that message
>> > text:
>> >
>> > /*
>> >  * Ask the free page manager for a run of pages.  This should
>> > always
>> >  * succeed, since both get_best_segment and make_new_segment
>> > should
>> >  * only return a non-NULL pointer if it actually contains enough
>> >  * contiguous freespace.  If it does fail, something in our
>> > backend
>> >  * private state is out of whack, so use FATAL to kill the
>> > process.
>> >  */
>> > if (!FreePageManagerGet(segment_map->fpm, npages, _page))
>> > elog(FATAL,
>> >  "dsa_allocate could not find %zu free pages", npages);
>> >
>> > Now maybe that comment is being unreasonably optimistic, but it sure
>> > appears that this is supposed to be a can't-happen case, in which case
>> > you've found a bug.
>>
>> This is probably the bug fixed here:
>>
>>
>> https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org
>>
>> That was back patched, so 10.2 will contain the fix.  The bug was not
>> in dsa.c itself, but in the parallel query code that mixed up DSA
>> areas, corrupting them.  The problem comes up when the query plan has
>> multiple Gather nodes (and a particular execution pattern) -- is that
>> the case here, in the EXPLAIN output?  That seems plausible given the
>> description of a 50-branch UNION.  The only workaround until 10.2
>> would be to reduce max_parallel_workers_per_gather to 0 to prevent
>> parallelism completely for this query.
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
>
>



Re: dsa_allocate() faliure

2018-01-29 Thread Thomas Munro
On Tue, Jan 30, 2018 at 5:37 AM, Tom Lane  wrote:
> Rick Otten  writes:
>> I'm wondering if there is anything I can tune in my PG 10.1 database to
>> avoid these errors:
>
>> $  psql -f failing_query.sql
>> psql:failing_query.sql:46: ERROR:  dsa_allocate could not find 7 free pages
>> CONTEXT:  parallel worker
>
> Hmm.  There's only one place in the source code that emits that message
> text:
>
> /*
>  * Ask the free page manager for a run of pages.  This should always
>  * succeed, since both get_best_segment and make_new_segment should
>  * only return a non-NULL pointer if it actually contains enough
>  * contiguous freespace.  If it does fail, something in our backend
>  * private state is out of whack, so use FATAL to kill the process.
>  */
> if (!FreePageManagerGet(segment_map->fpm, npages, _page))
> elog(FATAL,
>  "dsa_allocate could not find %zu free pages", npages);
>
> Now maybe that comment is being unreasonably optimistic, but it sure
> appears that this is supposed to be a can't-happen case, in which case
> you've found a bug.

This is probably the bug fixed here:

https://www.postgresql.org/message-id/E1eQzIl-0004wM-K3%40gemulon.postgresql.org

That was back patched, so 10.2 will contain the fix.  The bug was not
in dsa.c itself, but in the parallel query code that mixed up DSA
areas, corrupting them.  The problem comes up when the query plan has
multiple Gather nodes (and a particular execution pattern) -- is that
the case here, in the EXPLAIN output?  That seems plausible given the
description of a 50-branch UNION.  The only workaround until 10.2
would be to reduce max_parallel_workers_per_gather to 0 to prevent
parallelism completely for this query.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: dsa_allocate() faliure

2018-01-29 Thread Tom Lane
Rick Otten  writes:
> I'm wondering if there is anything I can tune in my PG 10.1 database to
> avoid these errors:

> $  psql -f failing_query.sql
> psql:failing_query.sql:46: ERROR:  dsa_allocate could not find 7 free pages
> CONTEXT:  parallel worker

Hmm.  There's only one place in the source code that emits that message
text:

/*
 * Ask the free page manager for a run of pages.  This should always
 * succeed, since both get_best_segment and make_new_segment should
 * only return a non-NULL pointer if it actually contains enough
 * contiguous freespace.  If it does fail, something in our backend
 * private state is out of whack, so use FATAL to kill the process.
 */
if (!FreePageManagerGet(segment_map->fpm, npages, _page))
elog(FATAL,
 "dsa_allocate could not find %zu free pages", npages);

Now maybe that comment is being unreasonably optimistic, but it sure
appears that this is supposed to be a can't-happen case, in which case
you've found a bug.

cc'ing the DSA authors for comment.

regards, tom lane