RE: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-28 Thread Ranier Vilela
Marco wrote:
> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_ id>___ number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Amit wrote:
>Interesting.  This does kind of gets us closer to figuring out what
>might have gone wrong, but hard to tell without the core dump at hand.

If something is corrupting memory rarely. It would be interesting to consider 
all the possibilities.
The MemSet function has an error alert on line 785 (twophase.c).
The size the var "_vstart" buffer, is not multiple size of the type long.
Maybe it's filling more than it should.

Ranier Vilela




Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-28 Thread Amit Langote
Hi Marco,

On Thu, Nov 28, 2019 at 5:02 PM Marco Slot  wrote:
>
> On Thu, Nov 28, 2019 at 6:18 AM Amit Langote  wrote:
> > Interesting.  Still, I think you'd be in better position than anyone
> > else to come up with reproduction steps for vanilla PostgreSQL by
> > analyzing the stack trace if and when the crash next occurs (or using
> > the existing core dump).  It's hard to tell by only guessing what may
> > have gone wrong when there is external code involved, especially
> > something like Citus that hooks into many points within vanilla
> > PostgreSQL.
>
> To clarify: In a Citus cluster you typically have a coordinator which
> contains the "distributed tables" and one or more workers which
> contain the data. All are PostgreSQL servers with the citus extension.
> The coordinator uses every available hook in PostgreSQL to make the
> distributed tables behave like regular tables. Any crash on the
> coordinator is likely to be attributable to Citus, because most of the
> code that is exercised is Citus code. The workers are used as regular
> PostgreSQL servers with the coordinator acting as a regular client. On
> the worker, the ProcessUtility hook will just pass on the arguments to
> standard_ProcessUtility without any processing. The crash happened on
> a worker.

Thanks for clarifying.

> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_ id>___ number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Interesting.  This does kind of gets us closer to figuring out what
might have gone wrong, but hard to tell without the core dump at hand.

Thanks,
Amit




Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-28 Thread Marco Slot
On Thu, Nov 28, 2019 at 6:18 AM Amit Langote  wrote:
> Interesting.  Still, I think you'd be in better position than anyone
> else to come up with reproduction steps for vanilla PostgreSQL by
> analyzing the stack trace if and when the crash next occurs (or using
> the existing core dump).  It's hard to tell by only guessing what may
> have gone wrong when there is external code involved, especially
> something like Citus that hooks into many points within vanilla
> PostgreSQL.

To clarify: In a Citus cluster you typically have a coordinator which
contains the "distributed tables" and one or more workers which
contain the data. All are PostgreSQL servers with the citus extension.
The coordinator uses every available hook in PostgreSQL to make the
distributed tables behave like regular tables. Any crash on the
coordinator is likely to be attributable to Citus, because most of the
code that is exercised is Citus code. The workers are used as regular
PostgreSQL servers with the coordinator acting as a regular client. On
the worker, the ProcessUtility hook will just pass on the arguments to
standard_ProcessUtility without any processing. The crash happened on
a worker.

One interesting thing is the prepared transaction name generated by
the coordinator, which follows the form: citus. The server-wide transaction number is a 64-bit
counter that is kept in shared memory and starts at 1. That means that
over 4 billion (4207001212) transactions happened on the coordinator
since the server started, which quite possibly resulted in 4 billion
prepared transactions on this particular server. I'm wondering if some
counter is overflowing.

cheers,
Marco




Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-27 Thread Amit Langote
On Thu, Nov 28, 2019 at 2:00 PM LIANGBO  wrote:
>
> Hello:
>
> > Have you considered *also* reporting this to Citus developers, because 
> > while the crash seems to have occurred in the core PostgreSQL code they may 
> > have a better chance reproducing this if at all.
>
> I've sent this issue to the citus community, and then received the reply with 
> "Just a note that this appears to be a bug in Postgres 2PC code.".
> https://github.com/citusdata/citus/issues/3228

Interesting.  Still, I think you'd be in better position than anyone
else to come up with reproduction steps for vanilla PostgreSQL by
analyzing the stack trace if and when the crash next occurs (or using
the existing core dump).  It's hard to tell by only guessing what may
have gone wrong when there is external code involved, especially
something like Citus that hooks into many points within vanilla
PostgreSQL.

Thanks,
Amit




Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-27 Thread Michael Paquier
On Thu, Nov 28, 2019 at 01:24:00PM +0900, Amit Langote wrote:
> Have you considered *also* reporting this to Citus developers, because
> while the crash seems to have occurred in the core PostgreSQL code
> they may have a better chance reproducing this if at all.

Hard to fully conclude with the information at hand.  Still, if you
look at the backtrace, it complains about readRecordBuf being already
free'd, which is something that happens only if it is not NULL and
only when freeing the reader.  The thing is that this area is used
only as a temporary buffer for a record being read, which may
optionally get extended.  Please note as well that the stack trace
mentions multi_ProcessUtility(), which is not Postgres code.  So my
gut actually tells me that this is a Citus-only bug, and that there is
an issue with some memory context cleanup in a xact callback or such.
Just a guess, but this could explain why the memory area of
readRecordBuf just went magically away.

If you can produce a test case with just Postgres, that's another
story of course, and if it were a bug in Postgres, I would imagine
that a simple pgbench test running a lot of 2PC transactions in
parallel may be able to reproduce it after some time.
--
Michael


signature.asc
Description: PGP signature


Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-27 Thread Amit Langote
Hello,

On Wed, Nov 27, 2019 at 8:59 PM LIANGBO  wrote:
> I've met the following problem in our product environment. We tried to 
> reproduce the problem, but because of the low probability of occurrence, we 
> could not reproduce it.
> 1. phenomenon
> Backend process crashed when executing 2pc transaction in citus.
>
> 2. Occurrence condition
> Distributed transaction in business SQL
>
> PostgreSQL:10.7
> citus:7.4.1
> OS:RHEL6.3
>
> 3.2 stacktrace
>
> (gdb) bt
> #0  0x00369e232625 in raise () from /lib64/libc.so.6
> #1  0x00369e233e05 in abort () from /lib64/libc.so.6
> #2  0x00369e270537 in __libc_message () from /lib64/libc.so.6
> #3  0x00369e275f4e in malloc_printerr () from /lib64/libc.so.6
> #4  0x00369e278cf0 in _int_free () from /lib64/libc.so.6
> #5  0x004ff947 in XLogReaderFree (state=0x1a403a8) at 
> xlogreader.c:141
> #6  0x004e4387 in XlogReadTwoPhaseData (lsn=32886947137584, 
> buf=0x7fff4a40ec38, len=0x0) at twophase.c:1341
> #7  0x004e5849 in FinishPreparedTransaction (gid=0x19d7830 
> "citus_0_35791_4207001212_1287199", isCommit=1 '\001') at twophase.c:1411
> #8  0x0072d601 in standard_ProcessUtility (pstmt=0x19d7ba0, 
> queryString=0x19d6e48 "COMMIT PREPARED 'citus_0_35791_4207001212_1287199'", 
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
> queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at 
> utility.c:460
> #9  0x7f63a1ae97e1 in multi_ProcessUtility (pstmt=0x19d7ba0, 
> queryString=0x19d6e48 "COMMIT PREPARED 'citus_0_35791_4207001212_1287199'", 
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
> queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at 
> executor/multi_utility.c:254
> #10 0x7f63a11e1178 in pgss_ProcessUtility (pstmt=0x19d7ba0, 
> queryString=0x19d6e48 "COMMIT PREPARED 'citus_0_35791_4207001212_1287199'", 
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0,

Have you considered *also* reporting this to Citus developers, because
while the crash seems to have occurred in the core PostgreSQL code
they may have a better chance reproducing this if at all.

Thanks,
Amit