Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
On Tue, Aug 8, 2017 at 8:17 PM, Tom Lanewrote: > Peter Geoghegan writes: > > On Thu, Jan 19, 2017 at 5:45 PM, Peter Geoghegan wrote: > >> A customer is on 9.6.1, and complains of a segfault observed at least > >> 3 times. > > ... > > For the sake of the archives: this now looks very much like the issue > > that Tom just fixed with commit > > 9bf4068cc321a4d44ac54089ab651a49d89bb567. > > Yeah, particularly seeing that $customer noted that some of the > columns involved were UUIDs: > > https://www.postgresql.org/message-id/CAOxz3fqK9Y0GntL8MDoeZ > jy2ot_6lx1yvhay6bd1vykup-i...@mail.gmail.com > > Good to have gotten to the bottom of that one. Too bad it just > missed the train for 9.6.4. > > regards, tom lane > $customer, here. I just want to thank everyone involved for getting to the bottom of this and for your support. Even if it missed the 9.6.4 release, I'm very grateful for your help. We haven't had much of an issue since disabling parallel workers so nothing is harmed by waiting a little longer. Thanks, Justin
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
Peter Geogheganwrites: > On Thu, Jan 19, 2017 at 5:45 PM, Peter Geoghegan wrote: >> A customer is on 9.6.1, and complains of a segfault observed at least >> 3 times. > ... > For the sake of the archives: this now looks very much like the issue > that Tom just fixed with commit > 9bf4068cc321a4d44ac54089ab651a49d89bb567. Yeah, particularly seeing that $customer noted that some of the columns involved were UUIDs: https://www.postgresql.org/message-id/caoxz3fqk9y0gntl8mdoezjy2ot_6lx1yvhay6bd1vykup-i...@mail.gmail.com Good to have gotten to the bottom of that one. Too bad it just missed the train for 9.6.4. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
On Thu, Jan 19, 2017 at 5:45 PM, Peter Geogheganwrote: > A customer is on 9.6.1, and complains of a segfault observed at least > 3 times. > I can use GDB to get details of the instruction pointer that appeared > in the kernel trap error, which shows a function from the expanded > value representation infrastructure: > > (gdb) info symbol 0x55fcf08b0491 > EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres > (gdb) info symbol 0x55fcf08b0490 > EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres > (gdb) disassemble 0x55fcf08b0490 > Dump of assembler code for function EOH_get_flat_size: >0x55fcf08b0490 <+0>: push %rbp >0x55fcf08b0491 <+1>: mov0x8(%rdi),%rax >0x55fcf08b0495 <+5>: mov%rsp,%rbp >0x55fcf08b0498 <+8>: pop%rbp >0x55fcf08b0499 <+9>: mov(%rax),%rax >0x55fcf08b049c <+12>: jmpq *%rax > End of assembler dump. For the sake of the archives: this now looks very much like the issue that Tom just fixed with commit 9bf4068cc321a4d44ac54089ab651a49d89bb567. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
On Thu, Feb 16, 2017 at 11:45 AM, Justin Workmanwrote: >> It would help to know the data types of the columns involved in this >> query; but just eyeballing it, it doesn't look like it involves any >> array operations, so it's pretty hard to believe that the expanded-object >> code could have gotten invoked intentionally. (The mere presence of >> an array column somewhere in the vicinity would not do that; you'd >> need to invoke an array-ish operation, or at least pass the array into >> a plpgsql function.) >> If I had to bet on the basis of this much info, I would bet that the >> parallel-query infrastructure is dropping the ball somewhere and >> transmitting a corrupted datum that accidentally looks like it is >> an expanded-object reference. >> If $customer wants a quick fix, I'd suggest seeing whether disabling >> parallel query makes the problem go away. That might be a good first >> step anyway, just to narrow down where the problem lies. >> regards, tom lane > > > Hi Tom, > > My name is Justin, and I am $customer as it were. As Peter explained, we > haven't seen the segfaults anymore since disabling parallel queries. This > works as a quick fix and is much appreciated! If you would still like to get > to the bottom of this, I am willing to help out with more information as > needed. My knowledge of PG internals is extremely limited so I don't know > how much help I can be, but we'd like to see this resolved beyond the quick > fix, or at least understand why it happened. > Thanks Justin for the confirmation. Is it possible for you to get a standalone test case with which we can reproduce the problem? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
> > It would help to know the data types of the columns involved in this > query; but just eyeballing it, it doesn't look like it involves any > array operations, so it's pretty hard to believe that the expanded-object > code could have gotten invoked intentionally. (The mere presence of > an array column somewhere in the vicinity would not do that; you'd > need to invoke an array-ish operation, or at least pass the array into > a plpgsql function.) > If I had to bet on the basis of this much info, I would bet that the > parallel-query infrastructure is dropping the ball somewhere and > transmitting a corrupted datum that accidentally looks like it is > an expanded-object reference. > If $customer wants a quick fix, I'd suggest seeing whether disabling > parallel query makes the problem go away. That might be a good first > step anyway, just to narrow down where the problem lies. > regards, tom lane Hi Tom, My name is Justin, and I am $customer as it were. As Peter explained, we haven't seen the segfaults anymore since disabling parallel queries. This works as a quick fix and is much appreciated! If you would still like to get to the bottom of this, I am willing to help out with more information as needed. My knowledge of PG internals is extremely limited so I don't know how much help I can be, but we'd like to see this resolved beyond the quick fix, or at least understand why it happened. album_photo_assignments.id, album_photo_assignments.album_id, album_photo_assignments.photo_id and albums.id are all UUID columns. albums.deleted_at is a timestamp. Thanks so much for your time, Justin Workman
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
On Thu, Jan 19, 2017 at 6:39 PM, Tom Lanewrote: > If $customer wants a quick fix, I'd suggest seeing whether disabling > parallel query makes the problem go away. That might be a good first > step anyway, just to narrow down where the problem lies. I no longer work at Heroku, and so can do very little about it now, but I found out privately that disabling parallel query made the problem go away. It hasn't returned as of today. Parallel query was disabled this whole time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
On Thu, Jan 19, 2017 at 9:39 PM, Tom Lanewrote: > If I had to bet on the basis of this much info, I would bet that the > parallel-query infrastructure is dropping the ball somewhere and > transmitting a corrupted datum that accidentally looks like it is > an expanded-object reference. There's code in datumSerialize() that is intended to do the right thing with expanded-object references. But it could have a bug, of course. That function gets called from SerializeParamList(), which might be another place to look for problems if the crash can only be reproduced when there are statement parameters. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
Peter Geogheganwrites: > A customer is on 9.6.1, and complains of a segfault observed at least > 3 times. In all cases, the logs look like this: > ... > I can use GDB to get details of the instruction pointer that appeared > in the kernel trap error, which shows a function from the expanded > value representation infrastructure: It would help to know the data types of the columns involved in this query; but just eyeballing it, it doesn't look like it involves any array operations, so it's pretty hard to believe that the expanded-object code could have gotten invoked intentionally. (The mere presence of an array column somewhere in the vicinity would not do that; you'd need to invoke an array-ish operation, or at least pass the array into a plpgsql function.) If I had to bet on the basis of this much info, I would bet that the parallel-query infrastructure is dropping the ball somewhere and transmitting a corrupted datum that accidentally looks like it is an expanded-object reference. If $customer wants a quick fix, I'd suggest seeing whether disabling parallel query makes the problem go away. That might be a good first step anyway, just to narrow down where the problem lies. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1
A customer is on 9.6.1, and complains of a segfault observed at least 3 times. In all cases, the logs look like this: Jan 11 16:11:07 ip-10-0-118-82 kernel: [41913.530453] traps: postgres[6561] general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in postgres[55fcf0557000+638000] Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-1] LOG: server process (PID 643) was terminated by signal 11: Segmentation fault Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-2] DETAIL: Failed process was running: SELECT COUNT(count_column) FROM (SELECT 1 AS count_column FROM "albums" INNER JOIN "album_photo_assignments" ON "albums"."id" = "album_photo_assignments"."album_id" WHERE "albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id" = $1 LIMIT 1000 OFFSET 0) subquery_for_count Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [5-1] LOG: terminating any other active server processes Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-1] WARNING: terminating connection because of crash of another server process Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-3] HINT: In a moment you should be able to reconnect to the database and repeat your command. The "general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in postgres[55fcf0557000+638000]" details are identical in all 3 segfaults observed. Note that the kernel in question is "4.4.0-57-generic #78~14.04.1-Ubuntu". These crashes were days apart, but I have yet to isolate the problem. The customer tells me that this is a representative query plan: explain SELECT COUNT(count_column) FROM (SELECT 1 AS count_column FROM "albums" INNER JOIN "album_photo_assignments" ON "albums"."id" = "album_photo_assignments"."album_id" WHERE "albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id" = 'fd0bcbb3-9c4c-4036-954a-cf01935d004c' LIMIT 1000 OFFSET 0) subquery_for_count ; QUERY PLAN ─ Aggregate (cost=16419.94..16419.94 rows=1 width=8) -> Limit (cost=1000.06..16419.94 rows=1 width=4) -> Nested Loop (cost=1000.06..16419.94 rows=1 width=4) -> Gather (cost=1000.00..16415.87 rows=1 width=16) Workers Planned: 1 -> Parallel Seq Scan on album_photo_assignments (cost=0.00..15415.77 rows=1 width=16) Filter: (photo_id = 'fd0bcbb3-9c4c-4036-954a-cf01935d004c'::uuid) -> Index Scan using albums_pkey on albums (cost=0.06..4.06 rows=1 width=16) Index Cond: (id = album_photo_assignments.album_id) Filter: (deleted_at IS NULL) (10 rows) Note that this isn't a prepared statement, unlike the problematic query we see from the logs. And, I've not yet managed to verify that the problem case involved execution of this particular "$1 substitute constant", since that isn't displayed in the logs. So, while this query plan might well be representative of the problem, it may also not be. I can use GDB to get details of the instruction pointer that appeared in the kernel trap error, which shows a function from the expanded value representation infrastructure: (gdb) info symbol 0x55fcf08b0491 EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres (gdb) info symbol 0x55fcf08b0490 EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres (gdb) disassemble 0x55fcf08b0490 Dump of assembler code for function EOH_get_flat_size: 0x55fcf08b0490 <+0>: push %rbp 0x55fcf08b0491 <+1>: mov0x8(%rdi),%rax 0x55fcf08b0495 <+5>: mov%rsp,%rbp 0x55fcf08b0498 <+8>: pop%rbp 0x55fcf08b0499 <+9>: mov(%rax),%rax 0x55fcf08b049c <+12>: jmpq *%rax End of assembler dump. (I believe that Linux outputs VMA-wise addresses in the kernel trap error quoted above, an address from the userland program (Postgres) address space, so GDB does in fact display interesting disassembly -- the documentation on this seems extremely limited.) Note that there is a text array column in the table "albums". I note that there have been fixes in this area already, that are not in 9.6.1 (commit 77cd0dc, for one). I know next to nothing about this code, and would appreciate some guidance. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers