Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-18 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Craig,

While testing for another scenario of continuous postgres server restart, we 
got many cores of sh-QUIT and along with that we got cores for rm-QUIT. It is 
pointing to rm of the archive file but we were not able to get the bt as the 
stack is corrupted.

We got below info from gdb:
Core was generated by `rm ./Archive_00020118'.

And also we were able to get this info:
4518 12490  0.0  0.0  11484  1356 ?Ss   10:59   0:00 postgres: 
archiver process   archiving 00020118.0028.backup
4518 12704  2.0  0.0   7672  2932 ?S11:00   0:00   \_ sh -c 
rm ./Archive_*; touch ./Archive_"00020118.0028.backup"; 
exit 0
4518 12707  0.0  0.0344 4 ?S11:00   0:00
 \_ rm ./Archive_00020118

In the Postgres configuration file ,we have this information.
archive_command = 'rm ./Archive_*; touch ./Archive_"%f"; exit 0'

So while executing this archive command, core was generated.
You pointed out earlier that issue might be happening during archive command 
and also all evidence for this crash are pointing to this same command.
Are there any suggestions to recover from this situation or on ways to debug 
the issue ?

Regards,
Sandhya

From: K S, Sandhya (Nokia - IN/Bangalore)
Sent: Wednesday, July 12, 2017 4:51 PM
To: 'Craig Ringer' <cr...@2ndquadrant.com>
Cc: pgsql-bugs <pgsql-b...@postgresql.org>; PostgreSQL Hackers 
<pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) 
<rasn...@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: RE: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

Hi Craig,

Here is bt after installing all the missing debuginfo packages.

(gdb) bt
#0  0x00fff7682f18 in do_lookup_x (undef_name=undef_name@entry=0xfff75cece5 
"_Jv_RegisterClasses", new_hash=new_hash@entry=2681263574,
old_hash=old_hash@entry=0xa159b8, ref=0xfff75ceac8, 
result=result@entry=0xa159a0, scope=, i=1, 
version=version@entry=0x0,
flags=flags@entry=1, skip=skip@entry=0x0, type_class=type_class@entry=0, 
undef_map=undef_map@entry=0xfff76a9478) at dl-lookup.c:444
#1  0x00fff76839a0 in _dl_lookup_symbol_x (undef_name=0xfff75cece5 
"_Jv_RegisterClasses", undef_map=0xfff76a9478, ref=0xa15a90,
symbol_scope=0xfff76a9980, version=0x0, type_class=, 
flags=, skip_map=0x0) at dl-lookup.c:833
#2  0x00fff7685730 in elf_machine_got_rel (lazy=1, map=0xfff76a9478) at 
../sysdeps/mips/dl-machine.h:870
#3  elf_machine_runtime_setup (profile=, lazy=1, l=0xfff76a9478) 
at ../sysdeps/mips/dl-machine.h:916
#4  _dl_relocate_object (scope=0xfff76a9980, reloc_mode=, 
consider_profiling=0) at dl-reloc.c:259
#5  0x00fff767ba10 in dl_main (phdr=, 
phdr@entry=0x12040, phnum=, phnum@entry=8,
user_entry=user_entry@entry=0xa15cf0, auxv=) at 
rtld.c:2070
#6  0x00fff7692e3c in _dl_sysdep_start (start_argptr=, 
dl_main=0xfff7679a98 ) at ../elf/dl-sysdep.c:249
#7  0x00fff767d0d8 in _dl_start_final (arg=arg@entry=0xa16410, 
info=info@entry=0xa15d80) at rtld.c:307
#8  0x00fff767d3d8 in _dl_start (arg=0xa16410) at rtld.c:415
#9  0x00fff7679380 in __start () from /lib64/ld.so.1

Please see if this could help in analysing the issue.

Regards,
Sandhya

From: Craig Ringer [mailto:cr...@2ndquadrant.com]
Sent: Friday, July 07, 2017 1:53 PM
To: K S, Sandhya (Nokia - IN/Bangalore) 
<sandhya@nokia.com<mailto:sandhya@nokia.com>>
Cc: pgsql-bugs <pgsql-b...@postgresql.org<mailto:pgsql-b...@postgresql.org>>; 
PostgreSQL Hackers 
<pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>; T, Rasna 
(Nokia - IN/Bangalore) <rasn...@nokia.com<mailto:rasn...@nokia.com>>; Itnal, 
Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com<mailto:prakash.it...@nokia.com>>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 7 July 2017 at 15:41, K S, Sandhya (Nokia - IN/Bangalore) 
<sandhya@nokia.com<mailto:sandhya@nokia.com>> wrote:
Hi Craig,

The scenario is lock and unlock of the system for 30 times. During this 
scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different 
locations.
I have attached output for 2 such instance.


You seem to be missing debug symbols. Install appropriate debuginfo packages.


--
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-12 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Craig,

Here is bt after installing all the missing debuginfo packages.

(gdb) bt
#0  0x00fff7682f18 in do_lookup_x (undef_name=undef_name@entry=0xfff75cece5 
"_Jv_RegisterClasses", new_hash=new_hash@entry=2681263574,
old_hash=old_hash@entry=0xa159b8, ref=0xfff75ceac8, 
result=result@entry=0xa159a0, scope=, i=1, 
version=version@entry=0x0,
flags=flags@entry=1, skip=skip@entry=0x0, type_class=type_class@entry=0, 
undef_map=undef_map@entry=0xfff76a9478) at dl-lookup.c:444
#1  0x00fff76839a0 in _dl_lookup_symbol_x (undef_name=0xfff75cece5 
"_Jv_RegisterClasses", undef_map=0xfff76a9478, ref=0xa15a90,
symbol_scope=0xfff76a9980, version=0x0, type_class=, 
flags=, skip_map=0x0) at dl-lookup.c:833
#2  0x00fff7685730 in elf_machine_got_rel (lazy=1, map=0xfff76a9478) at 
../sysdeps/mips/dl-machine.h:870
#3  elf_machine_runtime_setup (profile=, lazy=1, l=0xfff76a9478) 
at ../sysdeps/mips/dl-machine.h:916
#4  _dl_relocate_object (scope=0xfff76a9980, reloc_mode=, 
consider_profiling=0) at dl-reloc.c:259
#5  0x00fff767ba10 in dl_main (phdr=, 
phdr@entry=0x12040, phnum=, phnum@entry=8,
user_entry=user_entry@entry=0xa15cf0, auxv=) at 
rtld.c:2070
#6  0x00fff7692e3c in _dl_sysdep_start (start_argptr=, 
dl_main=0xfff7679a98 ) at ../elf/dl-sysdep.c:249
#7  0x00fff767d0d8 in _dl_start_final (arg=arg@entry=0xa16410, 
info=info@entry=0xa15d80) at rtld.c:307
#8  0x00fff767d3d8 in _dl_start (arg=0xa16410) at rtld.c:415
#9  0x00fff7679380 in __start () from /lib64/ld.so.1

Please see if this could help in analysing the issue.

Regards,
Sandhya

From: Craig Ringer [mailto:cr...@2ndquadrant.com]
Sent: Friday, July 07, 2017 1:53 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-bugs <pgsql-b...@postgresql.org>; PostgreSQL Hackers 
<pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) 
<rasn...@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 7 July 2017 at 15:41, K S, Sandhya (Nokia - IN/Bangalore) 
<sandhya@nokia.com<mailto:sandhya@nokia.com>> wrote:
Hi Craig,

The scenario is lock and unlock of the system for 30 times. During this 
scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different 
locations.
I have attached output for 2 such instance.


You seem to be missing debug symbols. Install appropriate debuginfo packages.


--
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-07 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Craig,

The scenario is lock and unlock of the system for 30 times. During this 
scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different 
locations.
I have attached output for 2 such instance.

Regards,
Sandhya

From: Craig Ringer [mailto:cr...@2ndquadrant.com]
Sent: Friday, July 07, 2017 12:55 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-bugs <pgsql-b...@postgresql.org>; PostgreSQL Hackers 
<pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) 
<rasn...@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 7 July 2017 at 15:10, K S, Sandhya (Nokia - IN/Bangalore) 
<sandhya@nokia.com<mailto:sandhya@nokia.com>> wrote:
Hi Craig,

You were right about the restore_command.

This all makes sense then.

PostgreSQL sends SIGQUIT for immediate shutdown to its children. So the 
restore_command would get signalled too.

Can't immediately explain the exit code, and SIGQUIT should _not_ generate a 
core file. Can you show the result of attaching 'gdb' to the core file and 
running 'bt full' ?

--
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
GDB of first instance of corefile.
# gdb /bin/bash CFPU-1-7919-595e59a9-sh-QUIT.core
GNU gdb (Wind River Linux G++ 4.4a-470) 7.2.50.20100908-cvs
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "mips64-wrs-linux-gnu".
For bug reporting instructions, please see:
<supp...@windriver.com>...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 7919]
Reading symbols from /lib64/libreadline.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libreadline.so.5
Reading symbols from /lib64/libhistory.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libhistory.so.5
Reading symbols from /lib64/libncurses.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libncurses.so.5
Reading symbols from /lib64/libdl.so.2...Reading symbols from 
/mnt/sysimg/usr/lib/debug/lib64/libdl-2.11.1.so.debug...(no debugging symbols 
found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libc.so.6...Reading symbols from 
/mnt/sysimg/usr/lib/debug/lib64/libc-2.11.1.so.debug...done.
done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/libtinfo.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib64/libtinfo.so.5
Reading symbols from /lib64/ld.so.1...Reading symbols from 
/mnt/sysimg/usr/lib/debug/lib64/ld-2.11.1.so.debug...(no debugging symbols 
found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/ld.so.1
Core was generated by `sh -c exit 1'.
Program terminated with signal 3, Quit.
#0  0x005558246a80 in _dl_lookup_symbol_x () from /lib64/ld.so.1
(gdb) bt full
#0  0x005558246a80 in _dl_lookup_symbol_x () from /lib64/ld.so.1
No symbol table info available.
#1  0x00555824816c in _dl_relocate_object () from /lib64/ld.so.1
No symbol table info available.
#2  0x00555823fb6c in dl_main () from /lib64/ld.so.1
No symbol table info available.
#3  0x005558254214 in _dl_sysdep_start () from /lib64/ld.so.1
No symbol table info available.
#4  0x00555823d1b0 in _dl_start_final () from /lib64/ld.so.1
No symbol table info available.
#5  0x00555823d3f0 in _dl_start () from /lib64/ld.so.1
No symbol table info available.
#6  0x00555823cc10 in __start () from /lib64/ld.so.1
No symbol table info available.
Backtrace stopped: frame did not save the PC






GDB of second instance of corefile.
# gdb /bin/bash CFPU-1-15638-595e5efb-sh-QUIT.core   
GNU gdb (Wind River Linux G++ 4.4a-470) 7.2.50.20100908-cvs
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "mips64-wrs-linux-gnu".
For bug reporting instructions, please see:
<supp...@windriver.com>...
Reading symbols from /bin/bash...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 15638]
Reading symbols from /lib64/libreadline.so.5...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libreadline.so.5
Rea

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-07-03 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Craig,

Thanks for the response.

Scenario tried here is restart of the system multiple times. sh-QUIT core is 
generated when Postgres is invoking the shell to exit and may not be due to 
kernel or file system issues. I will try to reproduce the issue with dmesg 
output being printed.

However, is there any instance in Postgres where 'sh -c exit 1' will be invoked?

Regards,
Sandhya

-Original Message-
From: Craig Ringer [mailto:cr...@2ndquadrant.com] 
Sent: Friday, June 30, 2017 5:40 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; pgsql-b...@postgresql.org; T, Rasna (Nokia - 
IN/Bangalore) <rasn...@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 30 June 2017 at 17:41, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya@nokia.com> wrote:

> When we checked the process listing during the time of core generation, we
> found Postgres startup process is invoking “sh -c exit 1”:
> 4518  9249  0.1  0.0 155964  2036 ?Ss   15:05   0:00 postgres:
> startup process   waiting for 000102EB

Looks like an archive_command or restore_command .

If 'sh' is dumping core, you probably have issues at a low level in
the kernel, file system, etc. Check dmesg.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Postgres process invoking exit resulting in sh-QUIT core

2017-06-30 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi,

We are using Postgres version 9.3.14 over linux based OS and we are observing 
sh-QUIT core files randomly when we are restarting the system(occurrence seen 
once in 30 times).
Backtrace is showing as below:

Loaded symbols for /lib64/ld.so.1
Core was generated by `sh -c exit 1'.
Program terminated with signal 3, Quit.
#0  0x005559ed78f0 in do_lookup_x () from /lib64/ld.so.1
(gdb) bt
#0  0x005559ed78f0 in do_lookup_x () from /lib64/ld.so.1
#1  0x005559ed7b88 in _dl_lookup_symbol_x () from /lib64/ld.so.1
#2  0x005559ed916c in _dl_relocate_object () from /lib64/ld.so.1
#3  0x005559ed0b6c in dl_main () from /lib64/ld.so.1
#4  0x005559ee5214 in _dl_sysdep_start () from /lib64/ld.so.1
#5  0x005559ece1b0 in _dl_start_final () from /lib64/ld.so.1
#6  0x005559ece3f0 in _dl_start () from /lib64/ld.so.1
#7  0x005559ecdc10 in __start () from /lib64/ld.so.1
Backtrace stopped: frame did not save the PC

When we checked the process listing during the time of core generation, we 
found Postgres startup process is invoking "sh -c exit 1":
4518  9249  0.1  0.0 155964  2036 ?Ss   15:05   0:00 postgres: 
startup process   waiting for 000102EB
4518 10288  0.0  0.0   3600   508 ?S15:11   0:00  \_ sh -c exit 
1

We tried disabling DB and running the same testcase which didn't result in core 
being generated.
Also we are using immediate shutdown mode which uses SIGQUIT.

Can you please help us in debugging the issue ?

Regards,
Sandhya





Re: [HACKERS] [BUGS] Crash observed during the start of the Postgres process

2017-06-08 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Tom Lane,

After removing our patch to change FATAL to LOG, we are not observing the crash 
now. 

Thank you for your support. We were struck with this issue for a while.

Regards,
Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Friday, May 12, 2017 10:08 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: 'Merlin Moncure' <mmonc...@gmail.com>; 'pgsql-hackers@postgresql.org' 
<pgsql-hackers@postgresql.org>; 'pgsql-b...@postgresql.org' 
<pgsql-b...@postgresql.org>; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>; T, Rasna (Nokia - IN/Bangalore) <rasn...@nokia.com>
Subject: Re: [BUGS] Crash observed during the start of the Postgres process

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> I have filtered the logs based on PID (19825) to see if this helps to
> debug the issue further.

Is this really a stock Postgres build?

The proximate cause of the PANIC is that the startup process is seeing
other processes active even though it hasn't reachedConsistency.  This
is bad on any number of levels, quite aside from that particular PANIC,
because those other processes are presumably seeing non-consistent
database state.  Looking elsewhere in the log, we see that indeed there
seem to be several backend processes happily executing commands.
For instance, here's the trace of one of them starting up:

[19810-58f473ff.4d62-187] 2017-04-17 07:51:28.783 GMT < > DEBUG:  0: forked 
new backend, pid=19850 socket=10
[19810-58f473ff.4d62-188] 2017-04-17 07:51:28.783 GMT < > LOCATION:  
BackendStartup, postmaster.c:3884
[19850-58f47400.4d8a-1] 2017-04-17 07:51:28.783 GMT <  > LOG:  57P03: the 
database system is starting up
[19850-58f47400.4d8a-2] 2017-04-17 07:51:28.783 GMT <  > LOCATION:  
ProcessStartupPacket, postmaster.c:2143
[19850-58f47400.4d8a-3] 2017-04-17 07:51:28.784 GMT <  authentication> DEBUG:  
0: postgres child[19850]: starting with (

Now, that LOG message proves that this backend has observed that the
database is not ready to allow connections.  So why did it only emit the
message as LOG and keep going?  The code for this in 9.3 looks like

/*
 * If we're going to reject the connection due to database state, say so
 * now instead of wasting cycles on an authentication exchange. (This 
also
 * allows a pg_ping utility to be written.)
 */
switch (port->canAcceptConnections)
{
case CAC_STARTUP:
ereport(FATAL,
(errcode(ERRCODE_CANNOT_CONNECT_NOW),
 errmsg("the database system is 
starting up")));
break;
...

I can't draw any other conclusion but that you've hacked something
to make FATAL act like LOG.  Which is a fatal mistake.  Errors that
are marked FATAL are generally ones where allowing the process to
keep going is not an acceptable outcome.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Crash observed during the start of the Postgres process

2017-04-25 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

Did you get a chance to take a look into the issue?

Please consider it with high priority. We will be awaiting your inputs.

Regards,
Sandhya

_
From: K S, Sandhya (Nokia - IN/Bangalore)
Sent: Thursday, April 20, 2017 1:36 PM
To: pgsql-hackers@postgresql.org; 't...@sss.pgh.pa.us' <t...@sss.pgh.pa.us>; 
'pgsql-b...@postgresql.org' <pgsql-b...@postgresql.org>
Cc: Itnal, Prakash (Nokia - IN/Bangalore) <prakash.it...@nokia.com>; T, Rasna 
(Nokia - IN/Bangalore) <rasn...@nokia.com>
Subject: Crash observed during the start of the Postgres process


Hi ,

In Postgres 9.3.14, we are encountering crash during the start up of Postgres 
process. Logs state that it is trying to reap a dead process.
Attached the postgres debug logs and bt of the crash.


 << File: postgresql_crash_debug.log >>  << File: bt of Postgres crash.txt >>

Please help in analysing and debugging the issue.

Regards,
Sandhya





Re: [HACKERS] [BUGS] Crash observed during the start of the Postgres process

2017-04-25 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hi Merlin,

Below is the log captured when the crash was encountered. 

STATEMENT:  select count(1) from pg_ls_dir(current_setting('data_directory')) 
where pg_ls_dir = 'backup_label'
LOG:  0: duration: 4.083 ms
LOCATION:  exec_simple_query, postgres.c:1145
DEBUG:  0: shmem_exit(0): 7 callbacks to make
LOCATION:  shmem_exit, ipc.c:212
DEBUG:  0: proc_exit(0): 3 callbacks to make
LOCATION:  proc_exit_prepare, ipc.c:184
DEBUG:  0: exit(0)
LOCATION:  proc_exit, ipc.c:135
DEBUG:  0: shmem_exit(-1): 0 callbacks to make
LOCATION:  shmem_exit, ipc.c:212
DEBUG:  0: proc_exit(-1): 0 callbacks to make
LOCATION:  proc_exit_prepare, ipc.c:184
DEBUG:  0: reaping dead processes
LOCATION:  reaper, postmaster.c:2669
DEBUG:  0: server process (PID 11104) exited with exit code 0
LOCATION:  LogChildExit, postmaster.c:3385
DEBUG:  0: reaping dead processes
LOCATION:  reaper, postmaster.c:2669
LOG:  0: startup process (PID 10265) was terminated by signal 6: Aborted
LOCATION:  LogChildExit, postmaster.c:3407
LOG:  0: terminating any other active server processes
LOCATION:  HandleChildCrash, postmaster.c:3134
DEBUG:  0: sending SIGQUIT to process 10994
LOCATION:  HandleChildCrash, postmaster.c:3233
DEBUG:  0: sending SIGQUIT to process 10269
LOCATION:  HandleChildCrash, postmaster.c:3263
DEBUG:  0: sending SIGQUIT to process 10268
LOCATION:  HandleChildCrash, postmaster.c:3275
WARNING:  57P02: terminating connection because of crash of another server 
process
DETAIL:  The postmaster has commanded this server process to roll back the 
current transaction and exit, because another server process exited abnormally 
and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat 
your command.


Backtrace of the core generated:
(gdb) bt
#0  0x005563bcf9c0 in raise () from /lib64/libc.so.6
#1  0x005563bd42bc in abort () from /lib64/libc.so.6
#2  0x00012039e228 in errfinish ()
#3  0x00012039ef08 in elog_finish ()
#4  0x00012009eb08 in btree_redo ()
#5  0x0001200caff8 in StartupXLOG ()
#6  0x000120259958 in StartupProcessMain ()
#7  0x0001200d590c in AuxiliaryProcessMain ()
#8  0x000120253434 in ?? ()

Let me know if any further clarification/information is needed.

Regards,
Sandhya

-Original Message-
From: Merlin Moncure [mailto:mmonc...@gmail.com] 
Sent: Tuesday, April 25, 2017 8:01 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; t...@sss.pgh.pa.us; 
pgsql-b...@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>; T, Rasna (Nokia - IN/Bangalore) <rasn...@nokia.com>
Subject: Re: [BUGS] Crash observed during the start of the Postgres process

On Tue, Apr 25, 2017 at 8:44 AM, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya@nokia.com> wrote:
> Hello,
>
> Did you get a chance to take a look into the issue?
>
> Please consider it with high priority. We will be awaiting your inputs.

This email is heavily cross posted, which is obnoxious.  Can you paste
the relevant log snippet?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-11-22 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

The setup is made of hot-standby architecture and the issue is seen during 
normal run with normal load of 50% insert and 50% delete operation.
During startup of the standby node, we copy the data directory from the active 
postgres using pg_basebackup.

Meanwhile we are trying to create a test bed for people to try.

Regards,
Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, November 22, 2016 1:47 AM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> As suggested by you, we upgraded the postgres to version 9.3.14. Also we 
> removed all the patches we had applied before. But the issue is still 
> observed in the latest version as well.

Can you make a test case for other people to try?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-11-21 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

As suggested by you, we upgraded the postgres to version 9.3.14. Also we 
removed all the patches we had applied before. But the issue is still observed 
in the latest version as well.

The issue is seen during normal run and only observed in the standby node. 

This time as well, the same error log is observed.
node-1 postgres[8743]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
cannot operate with inconsistent data

Can you please share your inputs which would help us proceed further?

Regards,
Sandhya
 
-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Friday, September 16, 2016 1:29 AM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> We tried to replicate the scenario without our patch(exiting postmaster) and 
> still we were able to see the issue.

> Same error was seen this time as well.
> node-0 postgres[8243]: [1-2] HINT:  Is another postmaster already running on 
> port 5433? If not, wait a few seconds and retry.  
> node-1 postgres[8650]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
> cannot operate with inconsistent data

> Crash was not seen in 9.3.9 without the patch but it was reproduced in 9.3.11.
> So something specifically changed between 9.3.9 and 9.3.11 is causing the 
> issue.

Well, I looked through the git history from 9.3.9 to 9.3.11 and I don't
see anything that seems likely to explain a problem here.

If you can reproduce this, which it sounds like you can, maybe you could
create a self-contained test case for other people to try?

Also worth noting is that the current 9.3.x release is 9.3.14.  You
might save yourself some time by updating and seeing if it still
reproduces in 9.3.14.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-15 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

We tried to replicate the scenario without our patch(exiting postmaster) and 
still we were able to see the issue.

Same error was seen this time as well.
node-0 postgres[8243]: [1-2] HINT:  Is another postmaster already running on 
port 5433? If not, wait a few seconds and retry.  
node-1 postgres[8650]: [18-1] PANIC:  btree_xlog_delete_get_latestRemovedXid: 
cannot operate with inconsistent data

Crash was not seen in 9.3.9 without the patch but it was reproduced in 9.3.11.
So something specifically changed between 9.3.9 and 9.3.11 is causing the issue.

Thanks in advance!!!

Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, September 06, 2016 5:04 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> I was able to find a patch file where there is a call to ExitPostmaster() in 
> postmaster.c . 

> @@ -3081,6 +3081,11 @@
> shmem_exit(1);
> reset_shared(PostPortNumber);
 
> +   /* recovery termination */
> +   ereport(FATAL,
> +   (errmsg("recovery termination due to process crash")));
> +   ExitPostmaster(99);
> +
> StartupPID = StartupDataBase();
> Assert(StartupPID != 0); 
> pmState = PM_STARTUP;

There's no such code in the community sources, and I can't say that
such a patch looks like a bright idea to me.  It would disable any
restart after a crash (not only during recovery).

If you're running a version with assorted random non-community patches,
we can't really offer much support for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-06 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

I was able to find a patch file where there is a call to ExitPostmaster() in 
postmaster.c . 

@@ -3081,6 +3081,11 @@
shmem_exit(1);
reset_shared(PostPortNumber);
 
+   /* recovery termination */
+   ereport(FATAL,
+   (errmsg("recovery termination due to process crash")));
+   ExitPostmaster(99);
+
StartupPID = StartupDataBase();
Assert(StartupPID != 0); 
pmState = PM_STARTUP;

But this patch is there from 2009 when Postgres was upgraded to 9.0. I am 
checking on why this patch was introduced in the first place.
Still the question exists of why the issue is not seen in version 9.3.9 but 
exists in 9.3.11.

Also the case of standalone recovery is taken care of with introduction of the 
patch file.

"err-3" is part of postgres source code(nbtxlog.c). Two different lines are 
combined probably leading to confusion.
Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data
Aug 22 11:44:52.065971 crit node-1 postgres[8629]: [18-2] CONTEXT:  xlog redo 
delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;

Thanks in advance!!!
Sandhya


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, September 01, 2016 7:19 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> Our setup is a hot-standby architecture. This crash is occurring only on 
> stand-by node. Postgres continues to run without any issues on active node.
> Postmaster is waiting for a start and is throwing this message.

> Aug 22 11:44:21.462555 info node-0 postgres[8222]: [1-2] HINT:  Is another 
> postmaster already running on port 5433? If not, wait a few seconds and 
> retry.  
> Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
> btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent 
> dataAug 22 11:44:52.065971 crit CFPU-1 postgres[8629]: [18-2] CONTEXT:  xlog 
> redo delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;

Hmm, that HINT seems to be the tail end of a message indicating that the
postmaster is refusing to start because of an existing postmaster.  Why
is that appearing?  If you've got some script that's overeagerly launching
and killing postmasters, maybe that's the ultimate cause of problems.

The only method I've heard of for getting that get_latestRemovedXid
error is to try to launch a standalone backend (postgres --single)
in a standby server directory.  We don't support that, cf
https://www.postgresql.org/message-id/flat/00F0B2CEF6D0CEF8A90119D4%40eje.credativ.lan

BTW, I'm curious about the "err-3:" part.  That would not be expected
in any standard build of Postgres ... is this something custom modified?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres abort found in 9.3.11

2016-09-02 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello Tom,

Apologies for delayed reply.

Our setup is a hot-standby architecture. This crash is occurring only on 
stand-by node. Postgres continues to run without any issues on active node.
Postmaster is waiting for a start and is throwing this message.

Aug 22 11:44:21.462555 info node-0 postgres[8222]: [1-2] HINT:  Is another 
postmaster already running on port 5433? If not, wait a few seconds and retry.  
Aug 22 11:44:52.065760 crit node-1 postgres[8629]: [18-1] err-3:  
btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent 
dataAug 22 11:44:52.065971 crit CFPU-1 postgres[8629]: [18-2] CONTEXT:  xlog 
redo delete: index 1663/16386/17378; iblk 1, heap 1663/16386/16518;
Aug 22 11:44:52.085486 info node-1 coredumper: Generating core file 

The standby postgres recovers automatically on next restart. This is because we 
always copy db freshly from active node on restart.

We implemented one patch to force kill walsender on active side. This is done 
to avoid prolonged wait if standby node is not reachable (for eg. Force power 
off or LAN cable removal). This implementation exists from long time. However 
the issue only recently observed after upgrading to 9.3.11. Do you think this 
force kill of walsender might lead to such issues in latest postgres?


Regards,
Sandhya

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, August 30, 2016 5:09 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya@nokia.com>
Cc: pgsql-hackers@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) 
<prakash.it...@nokia.com>
Subject: Re: [HACKERS] Postgres abort found in 9.3.11

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya@nokia.com> writes:
> During the server restart, we are getting postgres crash with sigabrt. No 
> other operation being performed.
> Attached the backtrace.

What shows up in the postmaster log?

> The occurrence is occasional. The issue is seen once in 30~50 times.

Does it successfully restart if you try again?  If not, what are you
doing to recover?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Postgres abort found in 9.3.11

2016-08-30 Thread K S, Sandhya (Nokia - IN/Bangalore)
Hello,

During the server restart, we are getting postgres crash with sigabrt. No other 
operation being performed.
Attached the backtrace.


The occurrence is occasional. The issue is seen once in 30~50 times.
Recently we had performed postgres upgrade from 9.3.9 to 9.3.11. The issue is 
not seen in 9.3.9.

Postgres server version: 9.3.11
Target architecture: mips-64

We checked the difference between 9.3.9 and 9.3.11 and nothing relevant seems 
to be causing the crash.

Please help in resolving the issue as we are not competent with the postgres 
code.
Also if you see any difference valid be 9.3.9 and 9.3.11 which might be 
pointing to this issue, please let us know.

Thanks in advance!!!
Sandhya



#0  0x005558e909c0 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x005558e909c0 in raise () from /lib64/libc.so.6
#1  0x005558e952bc in abort () from /lib64/libc.so.6
#2  0x00012039db88 in errfinish ()
#3  0x00012039e868 in elog_finish ()
#4  0x00012009ea08 in btree_redo ()
#5  0x0001200cb178 in StartupXLOG ()
#6  0x000120259838 in StartupProcessMain ()
#7  0x0001200d5b3c in AuxiliaryProcessMain ()
#8  0x000120253314 in ?? ()


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers