RE: [Incident report]Backend process crashed when executing 2pc transaction
Marco wrote: > One interesting thing is the prepared transaction name generated by > the coordinator, which follows the form: citus_ id>___ number in session>. The server-wide transaction number is a 64-bit > counter that is kept in shared memory and starts at 1. That means that > over 4 billion (4207001212) transactions happened on the coordinator > since the server started, which quite possibly resulted in 4 billion > prepared transactions on this particular server. I'm wondering if some > counter is overflowing. Amit wrote: >Interesting. This does kind of gets us closer to figuring out what >might have gone wrong, but hard to tell without the core dump at hand. If something is corrupting memory rarely. It would be interesting to consider all the possibilities. The MemSet function has an error alert on line 785 (twophase.c). The size the var "_vstart" buffer, is not multiple size of the type long. Maybe it's filling more than it should. Ranier Vilela
Re: [Incident report]Backend process crashed when executing 2pc transaction
Hi Marco, On Thu, Nov 28, 2019 at 5:02 PM Marco Slot wrote: > > On Thu, Nov 28, 2019 at 6:18 AM Amit Langote wrote: > > Interesting. Still, I think you'd be in better position than anyone > > else to come up with reproduction steps for vanilla PostgreSQL by > > analyzing the stack trace if and when the crash next occurs (or using > > the existing core dump). It's hard to tell by only guessing what may > > have gone wrong when there is external code involved, especially > > something like Citus that hooks into many points within vanilla > > PostgreSQL. > > To clarify: In a Citus cluster you typically have a coordinator which > contains the "distributed tables" and one or more workers which > contain the data. All are PostgreSQL servers with the citus extension. > The coordinator uses every available hook in PostgreSQL to make the > distributed tables behave like regular tables. Any crash on the > coordinator is likely to be attributable to Citus, because most of the > code that is exercised is Citus code. The workers are used as regular > PostgreSQL servers with the coordinator acting as a regular client. On > the worker, the ProcessUtility hook will just pass on the arguments to > standard_ProcessUtility without any processing. The crash happened on > a worker. Thanks for clarifying. > One interesting thing is the prepared transaction name generated by > the coordinator, which follows the form: citus_ id>___ number in session>. The server-wide transaction number is a 64-bit > counter that is kept in shared memory and starts at 1. That means that > over 4 billion (4207001212) transactions happened on the coordinator > since the server started, which quite possibly resulted in 4 billion > prepared transactions on this particular server. I'm wondering if some > counter is overflowing. Interesting. This does kind of gets us closer to figuring out what might have gone wrong, but hard to tell without the core dump at hand. Thanks, Amit
Re: [Incident report]Backend process crashed when executing 2pc transaction
On Thu, Nov 28, 2019 at 6:18 AM Amit Langote wrote: > Interesting. Still, I think you'd be in better position than anyone > else to come up with reproduction steps for vanilla PostgreSQL by > analyzing the stack trace if and when the crash next occurs (or using > the existing core dump). It's hard to tell by only guessing what may > have gone wrong when there is external code involved, especially > something like Citus that hooks into many points within vanilla > PostgreSQL. To clarify: In a Citus cluster you typically have a coordinator which contains the "distributed tables" and one or more workers which contain the data. All are PostgreSQL servers with the citus extension. The coordinator uses every available hook in PostgreSQL to make the distributed tables behave like regular tables. Any crash on the coordinator is likely to be attributable to Citus, because most of the code that is exercised is Citus code. The workers are used as regular PostgreSQL servers with the coordinator acting as a regular client. On the worker, the ProcessUtility hook will just pass on the arguments to standard_ProcessUtility without any processing. The crash happened on a worker. One interesting thing is the prepared transaction name generated by the coordinator, which follows the form: citus. The server-wide transaction number is a 64-bit counter that is kept in shared memory and starts at 1. That means that over 4 billion (4207001212) transactions happened on the coordinator since the server started, which quite possibly resulted in 4 billion prepared transactions on this particular server. I'm wondering if some counter is overflowing. cheers, Marco
Re: [Incident report]Backend process crashed when executing 2pc transaction
On Thu, Nov 28, 2019 at 2:00 PM LIANGBO wrote: > > Hello: > > > Have you considered *also* reporting this to Citus developers, because > > while the crash seems to have occurred in the core PostgreSQL code they may > > have a better chance reproducing this if at all. > > I've sent this issue to the citus community, and then received the reply with > "Just a note that this appears to be a bug in Postgres 2PC code.". > https://github.com/citusdata/citus/issues/3228 Interesting. Still, I think you'd be in better position than anyone else to come up with reproduction steps for vanilla PostgreSQL by analyzing the stack trace if and when the crash next occurs (or using the existing core dump). It's hard to tell by only guessing what may have gone wrong when there is external code involved, especially something like Citus that hooks into many points within vanilla PostgreSQL. Thanks, Amit
Re: [Incident report]Backend process crashed when executing 2pc transaction
On Thu, Nov 28, 2019 at 01:24:00PM +0900, Amit Langote wrote: > Have you considered *also* reporting this to Citus developers, because > while the crash seems to have occurred in the core PostgreSQL code > they may have a better chance reproducing this if at all. Hard to fully conclude with the information at hand. Still, if you look at the backtrace, it complains about readRecordBuf being already free'd, which is something that happens only if it is not NULL and only when freeing the reader. The thing is that this area is used only as a temporary buffer for a record being read, which may optionally get extended. Please note as well that the stack trace mentions multi_ProcessUtility(), which is not Postgres code. So my gut actually tells me that this is a Citus-only bug, and that there is an issue with some memory context cleanup in a xact callback or such. Just a guess, but this could explain why the memory area of readRecordBuf just went magically away. If you can produce a test case with just Postgres, that's another story of course, and if it were a bug in Postgres, I would imagine that a simple pgbench test running a lot of 2PC transactions in parallel may be able to reproduce it after some time. -- Michael signature.asc Description: PGP signature
Re: [Incident report]Backend process crashed when executing 2pc transaction
Hello, On Wed, Nov 27, 2019 at 8:59 PM LIANGBO wrote: > I've met the following problem in our product environment. We tried to > reproduce the problem, but because of the low probability of occurrence, we > could not reproduce it. > 1. phenomenon > Backend process crashed when executing 2pc transaction in citus. > > 2. Occurrence condition > Distributed transaction in business SQL > > PostgreSQL:10.7 > citus:7.4.1 > OS:RHEL6.3 > > 3.2 stacktrace > > (gdb) bt > #0 0x00369e232625 in raise () from /lib64/libc.so.6 > #1 0x00369e233e05 in abort () from /lib64/libc.so.6 > #2 0x00369e270537 in __libc_message () from /lib64/libc.so.6 > #3 0x00369e275f4e in malloc_printerr () from /lib64/libc.so.6 > #4 0x00369e278cf0 in _int_free () from /lib64/libc.so.6 > #5 0x004ff947 in XLogReaderFree (state=0x1a403a8) at > xlogreader.c:141 > #6 0x004e4387 in XlogReadTwoPhaseData (lsn=32886947137584, > buf=0x7fff4a40ec38, len=0x0) at twophase.c:1341 > #7 0x004e5849 in FinishPreparedTransaction (gid=0x19d7830 > "citus_0_35791_4207001212_1287199", isCommit=1 '\001') at twophase.c:1411 > #8 0x0072d601 in standard_ProcessUtility (pstmt=0x19d7ba0, > queryString=0x19d6e48 "COMMIT PREPARED 'citus_0_35791_4207001212_1287199'", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, > queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at > utility.c:460 > #9 0x7f63a1ae97e1 in multi_ProcessUtility (pstmt=0x19d7ba0, > queryString=0x19d6e48 "COMMIT PREPARED 'citus_0_35791_4207001212_1287199'", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, > queryEnv=0x0, dest=0x19d7c80, completionTag=0x7fff4a40f260 "") at > executor/multi_utility.c:254 > #10 0x7f63a11e1178 in pgss_ProcessUtility (pstmt=0x19d7ba0, > queryString=0x19d6e48 "COMMIT PREPARED 'citus_0_35791_4207001212_1287199'", > context=PROCESS_UTILITY_TOPLEVEL, params=0x0, Have you considered *also* reporting this to Citus developers, because while the crash seems to have occurred in the core PostgreSQL code they may have a better chance reproducing this if at all. Thanks, Amit