Re: PATCH: backtraces for error messages

2018-06-25 Thread Kyotaro HORIGUCHI
Hello. Thaks for discussing.

At Mon, 25 Jun 2018 17:08:41 +0800, Craig Ringer  wrote 
in 
> On 25 June 2018 at 14:21, Kyotaro HORIGUCHI  > > I think it's pretty strongly desirable for PANIC.
> >
> > Ah, I forgot about that. I agree to that. The cost to collect the
> > information is not a problem on PANIC. However still I don't
> > think stack trace is valuable on all PANIC messages. I can accept
> > the guc to control it but it is preferable that this works fine
> > without such an extra setup.
> >
> 
> Places such as?

Sorry but I didn't get this.

> > Agreed, it is the reality.  Instaed, can't we make a new error
> > classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
> > restrict stack dump for elog()?  Or elog_stackdump() and elog()
> > is also fine for me. Using it is easier than proper
> > annotating. It would be perfect if we could invent an automated
> > way but I don't think it is realistic.
> >
> 
> That needlessly complicates error severity levels with information not
> really related to the severity. -1 from me.

Ok.

> > Mmm. If I understand you correctly, I mean that perf doesn't dump
> > a backtrace on a probe point but trace points are usable to take
> > a symbolic backtrace. (I'm sorry that I cannot provide an example
> > since stap doesn't work in my box..)
> >
> 
> perf record --call-graph dwarf -e sdt_postgresql:checkpoint__start -u
> postgres
> perf report -g

Good to know that but unfortunately it doesn't work for me. It
(USDT support) is still in the perf Wiki Todo list. Some googling
told me that it is available on Linux 4.4. CentOS 7's kernel is
3.10. It seems a cutting-edge feature. However, I suppose that
what it emits is different from what wanted here.

But regardless of that, I agree that it is a bit hard to use
on-site.. (stap doesn't work for me from uncertain reason..)

> > If your intention is to take back traces without any setting (I
> > think it is preferable), it should be restricted to the required
> > points. It can be achieved by the additional error classes or
> > substitute error output functions.
> >
> 
> Who's classifying all the possible points?

I didn't mean classifying all points. I meant that only the
points where it is needed would be enough for the first step.

> Which PANICs or assertion failures do you want to exempt?

I didn't intended to cover all required casees. Intended that
save just a few or several obviously valuable cases, where we
gave up to add adequate context information for some reasons and
backtrace gives valuable information. Most xlog stuff wouldn't
need it. Lock stuff might need it. Heapam doesn't seem to need
since they seem to caused by calling order. etc... But such
control is out of this patch, I know.

> I definitely do not want to emit stacks for everything, like my patch
> currently does. It's just a proof of concept. Later on I'll want control on
> a fine grained level at runtime of when that happens, but that's out of
> scope for this. For now the goal is emit stacks at times it's obviously
> pretty sensible to have a stack, and do it in a way that doesn't require
> per-error-site maintenance/changes or create backport hassle.

Ok. The PoC is focusing on the means to emit backtraces and
controlling logs are the different issue, as Andres said
upthread.

I think that backtrace() generates somewhat strange-looking
output and we can get cleaner one using unwind. So I vote for
unwind to implement this. The cost is not a matter as far as we
intend to emit this for only Asserts or PANICs. Choosing some
ERRORs by a GUC is also fine for me.

> > As just an idea but can't we use an definition file on that
> > LOCATION of error messages that needs to dump a backtrace are
> > listed? That list is usually empty and should be very short if
> > any. The LOCATION information is easily obtained from a verbose
> > error message itself if once shown but a bit hard to find
> > otherwise..
> >
> 
> That's again veering into selective logging control territory. Rather than
> doing it for stack dump control only, it should be part of a broader
> control over dynamic and scoped verbosity, selective logging, and log
> options, like Pavan raised. I see stacks as just one knob that can be
> turned on/off here.
> 
> > (That reminds me, I need to chat with Devrim about creating a longer lived
> > > debuginfo + old versions rpms repo for Pg its self, if not the accessory
> > > bits and pieces. I'm so constantly frustrated by not being able to get
> > > needed debuginfo packages to investigate some core or running system
> > > problem because they've been purged from the PGDG yum repo as soon as a
> > new
> > > version comes out.)
> >
> > We in our department take care to preserve them for ourselves for
> > the necessity of supporting older systems. I sometimes feel that
> > It is very helpful if they were available on the official
> 
> 
> Maybe if I can get some interest in that, you might be willing to
> contribute your archives as a starter so 

Re: PATCH: backtraces for error messages

2018-06-25 Thread Korry Douglas
On Sat, Jun 23, 2018 at 8:22 AM, Craig Ringer  wrote:

> On 22 June 2018 at 23:26, Korry Douglas 
> wrote:
>
>> A few misc comments:
>>
>> In general, +1.
>>
>> It might be nice to move the stack trace code into a separate function
>> (not static to elog.c) - I have often found it useful to attach backtraces
>> to objects so I can debug complex allocation/deallocation problems.
>>
>
> Good suggestion.
>
>
>>
>> The major expense in capturing a stack trace is in the symbolization of
>> the stack addresses, not the stack walk itself.  You typically want to
>> symbolize the addresses at the time you generate the trace, but that's not
>> always the case.  For example, if you want to record the stack trace of
>> every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
>> performance gets unbearable if you symbolize every trace.  So a flag that
>> specifies whether to symbolize would be nice too.
>>
>
> libunwind has some nifty caching that makes that a _lot_ cheaper; that's
> part of why I went with it despite the extra lib requirement.
>

I've not used libunwind before - looking through the docs now.  It does
seem much more flexible than backtrace(), thanks.


>
>
>> If you don't symbolize, you need a way to capture the address range of
>> each dynamically loaded shared object (so that you can later symbolize
>> using something like addr2line).
>>
>
> Yeah. libunwind doesn't expose any interface to get that information, so
> you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or
> capture /proc/self/maps. You have to make sure that info makes it to the
> log, and is re-output often enough not to be lost to log rotation, and is
> invalidated and re-output if mappings change due to new lib loads etc. But
> you don't want to print it all the time either.
>
> Then you have to walk the stack and print the instruction pointers and
> stack pointers and spit out raw traces for the user to reassemble.
>
> I'd argue that if you're doing the sort of thing where you want a stack of
> every AllocSetAlloc, you shouldn't be trying to do that in-process. That's
> where tools like perf, systemtap, etc really come into their own. I'm
> focused on making additional diagnostic info for errors and key log
> messages collectable for systems that aren't easily accessed, like users
> who have 12-hour read/response latencies and security setups as strict as
> they are insane and nonsensical.
>

Understood - I realized after I replied that instrumenting AllocSetAlloc()
is a bit more complex than I had suggested.  When I need to capture the
call stack of each allocation, I store the stack trace in malloc'ed
buffers, otherwise I get recursive (in AllocSetAlloc()).  I suspect that's
unreasonable for a general-purpose solution.

I'd have no objection to the option to disable symbolic back traces and
> print the raw call stack. It's trivial to do; in fact, I only removed the
> ip/sp info to keep the output more concise.
>
> I'm not interested in writing anything to handle the library load address
> mapping collection etc though. I don't see a really sensible way to do that
> in a log-based system.
>

Agreed - I tend to use (saved, not logged) stack traces to find memory
leaks that valgrind can't see (or more precisely, excessive resource
consumption that is not actually leaked).

Thanks for your consideration.

 -- Korry


Re: PATCH: backtraces for error messages

2018-06-25 Thread Craig Ringer
On 25 June 2018 at 14:21, Kyotaro HORIGUCHI  wrote:

> Hi.
>
> At Mon, 25 Jun 2018 09:32:36 +0800, Craig Ringer 
> wrote in  gmail.com>
> > On 21 June 2018 at 19:09, Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp
> > > wrote:
> >
> > I think this for assertion failure is no problem but I'm not sure
> > > for other cases.
> >
> >
> > I think it's pretty strongly desirable for PANIC.
>
> Ah, I forgot about that. I agree to that. The cost to collect the
> information is not a problem on PANIC. However still I don't
> think stack trace is valuable on all PANIC messages. I can accept
> the guc to control it but it is preferable that this works fine
> without such an extra setup.
>

Places such as?


>
> > > We could set proper context description or other
> > > additional information in error messages before just dumping a
> > > trace for known cases.
> > >
> >
> > Yeah. The trouble there is that there are a _lot_ of places to touch for
> > such things, and inevitably the one you really want to see will be
> > something that didn't get suitably annotated.
>
> Agreed, it is the reality.  Instaed, can't we make a new error
> classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
> restrict stack dump for elog()?  Or elog_stackdump() and elog()
> is also fine for me. Using it is easier than proper
> annotating. It would be perfect if we could invent an automated
> way but I don't think it is realistic.
>

That needlessly complicates error severity levels with information not
really related to the severity. -1 from me.


> Mmm. If I understand you correctly, I mean that perf doesn't dump
> a backtrace on a probe point but trace points are usable to take
> a symbolic backtrace. (I'm sorry that I cannot provide an example
> since stap doesn't work in my box..)
>

perf record --call-graph dwarf -e sdt_postgresql:checkpoint__start -u
postgres
perf report -g


> If your intention is to take back traces without any setting (I
> think it is preferable), it should be restricted to the required
> points. It can be achieved by the additional error classes or
> substitute error output functions.
>

Who's classifying all the possible points?

Which PANICs or assertion failures do you want to exempt?

I definitely do not want to emit stacks for everything, like my patch
currently does. It's just a proof of concept. Later on I'll want control on
a fine grained level at runtime of when that happens, but that's out of
scope for this. For now the goal is emit stacks at times it's obviously
pretty sensible to have a stack, and do it in a way that doesn't require
per-error-site maintenance/changes or create backport hassle.


> As just an idea but can't we use an definition file on that
> LOCATION of error messages that needs to dump a backtrace are
> listed? That list is usually empty and should be very short if
> any. The LOCATION information is easily obtained from a verbose
> error message itself if once shown but a bit hard to find
> otherwise..
>

That's again veering into selective logging control territory. Rather than
doing it for stack dump control only, it should be part of a broader
control over dynamic and scoped verbosity, selective logging, and log
options, like Pavan raised. I see stacks as just one knob that can be
turned on/off here.

> (That reminds me, I need to chat with Devrim about creating a longer lived
> > debuginfo + old versions rpms repo for Pg its self, if not the accessory
> > bits and pieces. I'm so constantly frustrated by not being able to get
> > needed debuginfo packages to investigate some core or running system
> > problem because they've been purged from the PGDG yum repo as soon as a
> new
> > version comes out.)
>
> We in our department take care to preserve them for ourselves for
> the necessity of supporting older systems. I sometimes feel that
> It is very helpful if they were available on the official


Maybe if I can get some interest in that, you might be willing to
contribute your archives as a starter so we have them for back-versions?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-25 Thread Kyotaro HORIGUCHI
Hi.

At Mon, 25 Jun 2018 09:32:36 +0800, Craig Ringer  wrote 
in 
> On 21 June 2018 at 19:09, Kyotaro HORIGUCHI  > wrote:
> 
> I think this for assertion failure is no problem but I'm not sure
> > for other cases.
> 
> 
> I think it's pretty strongly desirable for PANIC.

Ah, I forgot about that. I agree to that. The cost to collect the
information is not a problem on PANIC. However still I don't
think stack trace is valuable on all PANIC messages. I can accept
the guc to control it but it is preferable that this works fine
without such an extra setup.

> > We could set proper context description or other
> > additional information in error messages before just dumping a
> > trace for known cases.
> >
> 
> Yeah. The trouble there is that there are a _lot_ of places to touch for
> such things, and inevitably the one you really want to see will be
> something that didn't get suitably annotated.

Agreed, it is the reality.  Instaed, can't we make a new error
classes PANIC_STACKDUMP and ERROR_STACKDUMP to explicitly
restrict stack dump for elog()?  Or elog_stackdump() and elog()
is also fine for me. Using it is easier than proper
annotating. It would be perfect if we could invent an automated
way but I don't think it is realistic.

> > Since PG9.5 RPMs are complied with --enable-dtrace so we could
> > use system tap to insert the stack-trace stuff. Additional
> > built-in probe in error reporting system or assertions would make
> > this pluggable.
> >
> 
> I don't bother with SystemTAP anymore; perf does the job for most purposes.
> 
> You can use --enable-dtrace SDTs with Perf fine. I wrote it up for Pg at
> https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
> .
> 
> Dynamic tracepoints are also very useful.
> 
> Both require debuginfo, but so does outputting of symbolic stacks per my
> patch.

Mmm. If I understand you correctly, I mean that perf doesn't dump
a backtrace on a probe point but trace points are usable to take
a symbolic backtrace. (I'm sorry that I cannot provide an example
since stap doesn't work in my box..)

If your intention is to take back traces without any setting (I
think it is preferable), it should be restricted to the required
points. It can be achieved by the additional error classes or
substitute error output functions.

As just an idea but can't we use an definition file on that
LOCATION of error messages that needs to dump a backtrace are
listed? That list is usually empty and should be very short if
any. The LOCATION information is easily obtained from a verbose
error message itself if once shown but a bit hard to find
otherwise..

> (That reminds me, I need to chat with Devrim about creating a longer lived
> debuginfo + old versions rpms repo for Pg its self, if not the accessory
> bits and pieces. I'm so constantly frustrated by not being able to get
> needed debuginfo packages to investigate some core or running system
> problem because they've been purged from the PGDG yum repo as soon as a new
> version comes out.)

We in our department take care to preserve them for ourselves for
the necessity of supporting older systems. I sometimes feel that
It is very helpful if they were available on the official site.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PATCH: backtraces for error messages

2018-06-24 Thread Craig Ringer
On 21 June 2018 at 19:09, Kyotaro HORIGUCHI  wrote:

I think this for assertion failure is no problem but I'm not sure
> for other cases.


I think it's pretty strongly desirable for PANIC.


> We could set proper context description or other
> additional information in error messages before just dumping a
> trace for known cases.
>

Yeah. The trouble there is that there are a _lot_ of places to touch for
such things, and inevitably the one you really want to see will be
something that didn't get suitably annotated.


> Since PG9.5 RPMs are complied with --enable-dtrace so we could
> use system tap to insert the stack-trace stuff. Additional
> built-in probe in error reporting system or assertions would make
> this pluggable.
>

I don't bother with SystemTAP anymore; perf does the job for most purposes.

You can use --enable-dtrace SDTs with Perf fine. I wrote it up for Pg at
https://wiki.postgresql.org/wiki/Profiling_with_perf#PostgreSQL_pre-defined_tracepoint_events
.

Dynamic tracepoints are also very useful.

Both require debuginfo, but so does outputting of symbolic stacks per my
patch.

(That reminds me, I need to chat with Devrim about creating a longer lived
debuginfo + old versions rpms repo for Pg its self, if not the accessory
bits and pieces. I'm so constantly frustrated by not being able to get
needed debuginfo packages to investigate some core or running system
problem because they've been purged from the PGDG yum repo as soon as a new
version comes out.)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-23 Thread Craig Ringer
On 23 June 2018 at 20:22, Craig Ringer  wrote:


> Yeah. libunwind doesn't expose any interface to get that information, so
> you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or
> capture /proc/self/maps.
>

Ahem, I take that part back.

https://stackoverflow.com/a/21584356/398670

see /usr/include/link.h



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-23 Thread Craig Ringer
On 22 June 2018 at 23:26, Korry Douglas 
wrote:

> A few misc comments:
>
> In general, +1.
>
> It might be nice to move the stack trace code into a separate function
> (not static to elog.c) - I have often found it useful to attach backtraces
> to objects so I can debug complex allocation/deallocation problems.
>

Good suggestion.


>
> The major expense in capturing a stack trace is in the symbolization of
> the stack addresses, not the stack walk itself.  You typically want to
> symbolize the addresses at the time you generate the trace, but that's not
> always the case.  For example, if you want to record the stack trace of
> every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
> performance gets unbearable if you symbolize every trace.  So a flag that
> specifies whether to symbolize would be nice too.
>

libunwind has some nifty caching that makes that a _lot_ cheaper; that's
part of why I went with it despite the extra lib requirement.



> If you don't symbolize, you need a way to capture the address range of
> each dynamically loaded shared object (so that you can later symbolize
> using something like addr2line).
>

Yeah. libunwind doesn't expose any interface to get that information, so
you'd have to use platform APIs, like glibc's dl_iterate_phdr or dladdr, or
capture /proc/self/maps. You have to make sure that info makes it to the
log, and is re-output often enough not to be lost to log rotation, and is
invalidated and re-output if mappings change due to new lib loads etc. But
you don't want to print it all the time either.

Then you have to walk the stack and print the instruction pointers and
stack pointers and spit out raw traces for the user to reassemble.

I'd argue that if you're doing the sort of thing where you want a stack of
every AllocSetAlloc, you shouldn't be trying to do that in-process. That's
where tools like perf, systemtap, etc really come into their own. I'm
focused on making additional diagnostic info for errors and key log
messages collectable for systems that aren't easily accessed, like users
who have 12-hour read/response latencies and security setups as strict as
they are insane and nonsensical.

I'd have no objection to the option to disable symbolic back traces and
print the raw call stack. It's trivial to do; in fact, I only removed the
ip/sp info to keep the output more concise.

I'm not interested in writing anything to handle the library load address
mapping collection etc though. I don't see a really sensible way to do that
in a log-based system.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-22 Thread Korry Douglas
A few misc comments:

In general, +1.

It might be nice to move the stack trace code into a separate function (not
static to elog.c) - I have often found it useful to attach backtraces to
objects so I can debug complex allocation/deallocation problems.

The major expense in capturing a stack trace is in the symbolization of the
stack addresses, not the stack walk itself.  You typically want to
symbolize the addresses at the time you generate the trace, but that's not
always the case.  For example, if you want to record the stack trace of
every call to AllocSetAlloc() (and attach the trace to the AllocChunk)
performance gets unbearable if you symbolize every trace.  So a flag that
specifies whether to symbolize would be nice too.

If you don't symbolize, you need a way to capture the address range of each
dynamically loaded shared object (so that you can later symbolize using
something like addr2line).

(The above suggestions relate to server debugging, not application
debugging).

End of wish list...

 -- Korry

On Fri, Jun 22, 2018 at 3:07 AM, Craig Ringer  wrote:

> On 22 June 2018 at 08:48, Andres Freund  wrote:
>
>> On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
>> > On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
>> > wrote:
>> >
>> > >
>> > >
>> > > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier <
>> mich...@paquier.xyz>
>> > > wrote:
>> > >
>> > >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
>> > >> > I wrote it because I got sick of Assert(false) debugging, and I was
>> > >> chasing
>> > >> > down some "ERROR:  08P01: insufficient data left in message"
>> errors.
>> > >> Then I
>> > >> > got kind of caught up in it... you know how it is.
>> > >>
>> > >> Yes, I know that feeling!  I have been using as well the
>> Assert(false)
>> > >> and the upgrade-to-PANIC tricks a couple of times, so being able to
>> get
>> > >> more easily backtraces would be really nice.
>> > >>
>> > >>
>> > > Sometime back I'd suggested an idea to be able to dynamically manage
>> log
>> > > levels for elog messages [1].
>> > >
>> >
>> >
>> > Huge +1 from me on being able to selectively manage logging on a
>> > module/subsystem, file, or line level.
>> >
>> > I don't think I saw the post.
>> >
>> > Such a thing would obviously make built in backtrace support much more
>> > useful.
>>
>> I strongly suggest keeping these as separate as possible. Either is
>> useful without the other, and both are nontrivial. Tackling them
>> together imo makes it much more likely to get nowhere.
>>
>>
> Totally agree, and it's why I raised this as its own thing.
>
> Thanks.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: PATCH: backtraces for error messages

2018-06-22 Thread Craig Ringer
On 22 June 2018 at 08:48, Andres Freund  wrote:

> On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> > On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
> > wrote:
> >
> > >
> > >
> > > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier  >
> > > wrote:
> > >
> > >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> > >> > I wrote it because I got sick of Assert(false) debugging, and I was
> > >> chasing
> > >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> > >> Then I
> > >> > got kind of caught up in it... you know how it is.
> > >>
> > >> Yes, I know that feeling!  I have been using as well the Assert(false)
> > >> and the upgrade-to-PANIC tricks a couple of times, so being able to
> get
> > >> more easily backtraces would be really nice.
> > >>
> > >>
> > > Sometime back I'd suggested an idea to be able to dynamically manage
> log
> > > levels for elog messages [1].
> > >
> >
> >
> > Huge +1 from me on being able to selectively manage logging on a
> > module/subsystem, file, or line level.
> >
> > I don't think I saw the post.
> >
> > Such a thing would obviously make built in backtrace support much more
> > useful.
>
> I strongly suggest keeping these as separate as possible. Either is
> useful without the other, and both are nontrivial. Tackling them
> together imo makes it much more likely to get nowhere.
>
>
Totally agree, and it's why I raised this as its own thing.

Thanks.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-21 Thread Pavan Deolasee
On Fri, Jun 22, 2018 at 6:18 AM, Andres Freund  wrote:

> On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> >
> >
> > Huge +1 from me on being able to selectively manage logging on a
> > module/subsystem, file, or line level.
> >
> > I don't think I saw the post.
> >
> > Such a thing would obviously make built in backtrace support much more
> > useful.
>
> I strongly suggest keeping these as separate as possible. Either is
> useful without the other, and both are nontrivial. Tackling them
> together imo makes it much more likely to get nowhere.
>
>
Sorry, I did not mean to mix up two patches. I brought it up just in case
it provides another idea about when and how to log the backtrace. So yeah,
let's discuss that patch separately.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-21 Thread Andres Freund
On 2018-06-22 08:24:45 +0800, Craig Ringer wrote:
> On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
> wrote:
> 
> >
> >
> > On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier 
> > wrote:
> >
> >> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> >> > I wrote it because I got sick of Assert(false) debugging, and I was
> >> chasing
> >> > down some "ERROR:  08P01: insufficient data left in message" errors.
> >> Then I
> >> > got kind of caught up in it... you know how it is.
> >>
> >> Yes, I know that feeling!  I have been using as well the Assert(false)
> >> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> >> more easily backtraces would be really nice.
> >>
> >>
> > Sometime back I'd suggested an idea to be able to dynamically manage log
> > levels for elog messages [1].
> >
> 
> 
> Huge +1 from me on being able to selectively manage logging on a
> module/subsystem, file, or line level.
> 
> I don't think I saw the post.
> 
> Such a thing would obviously make built in backtrace support much more
> useful.

I strongly suggest keeping these as separate as possible. Either is
useful without the other, and both are nontrivial. Tackling them
together imo makes it much more likely to get nowhere.

- Andres



Re: PATCH: backtraces for error messages

2018-06-21 Thread Craig Ringer
On Thu., 21 Jun. 2018, 19:26 Pavan Deolasee, 
wrote:

>
>
> On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier 
> wrote:
>
>> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
>> > I wrote it because I got sick of Assert(false) debugging, and I was
>> chasing
>> > down some "ERROR:  08P01: insufficient data left in message" errors.
>> Then I
>> > got kind of caught up in it... you know how it is.
>>
>> Yes, I know that feeling!  I have been using as well the Assert(false)
>> and the upgrade-to-PANIC tricks a couple of times, so being able to get
>> more easily backtraces would be really nice.
>>
>>
> Sometime back I'd suggested an idea to be able to dynamically manage log
> levels for elog messages [1].
>


Huge +1 from me on being able to selectively manage logging on a
module/subsystem, file, or line level.

I don't think I saw the post.

Such a thing would obviously make built in backtrace support much more
useful.

>


Re: PATCH: backtraces for error messages

2018-06-21 Thread Pavan Deolasee
On Thu, Jun 21, 2018 at 11:02 AM, Michael Paquier 
wrote:

> On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> > I wrote it because I got sick of Assert(false) debugging, and I was
> chasing
> > down some "ERROR:  08P01: insufficient data left in message" errors.
> Then I
> > got kind of caught up in it... you know how it is.
>
> Yes, I know that feeling!  I have been using as well the Assert(false)
> and the upgrade-to-PANIC tricks a couple of times, so being able to get
> more easily backtraces would be really nice.
>
>
Sometime back I'd suggested an idea to be able to dynamically manage log
levels for elog messages [1]. That did not invoke much response, but I
still believe some such facility will be very useful for debugging
production systems. Now I concede that the POC patch that I sent is
horribly written and has a bad UI, but those can be improved if there is
interest. Given the lack of response, I suspect there is enough interest in
the feature though.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CABOikdMvx_Kr_b4ELhJEoeGcLTZKrDma%2BfPZpoZVdpL7Zc0bVw%40mail.gmail.com

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: PATCH: backtraces for error messages

2018-06-21 Thread Kyotaro HORIGUCHI
Hello, I basically like this.

At Thu, 21 Jun 2018 12:35:10 +0800, Craig Ringer  wrote 
in 
> This is what the stacks look like btw
> 
> 
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG:  0:
> find_in_dynamic_libpath: trying
> "/home/craig/pg/10/lib/postgresql/pglogical.so"
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
> find_in_dynamic_libpath, dfmgr.c:639
> [2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
> FRAME0: find_in_dynamic_libpath +628
> FRAME1: expand_dynamic_library_name +205
> FRAME2: load_external_function +38
> FRAME3: LookupBackgroundWorkerFunction +197
> FRAME4: StartBackgroundWorker +549
> FRAME5: do_start_bgworker +466
> FRAME6: maybe_start_bgworkers +382
> FRAME7: reaper +895
> FRAME8: funlockfile +80
> FRAME9: __select +23
> FRAME   10: ServerLoop +394
> FRAME   11: PostmasterMain +4499
> 
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.
> 
> It also goes to show there are plenty of places you might want to get a
> stack where you don't have an internal errcode or panic. I don't think that
> idea will fly.

I think this for assertion failure is no problem but I'm not sure
for other cases. We could set proper context description or other
additional information in error messages before just dumping a
trace for known cases.

Since PG9.5 RPMs are complied with --enable-dtrace so we could
use system tap to insert the stack-trace stuff. Additional
built-in probe in error reporting system or assertions would make
this pluggable.

However, system tap doesn't work for me but I'm not sure how it
is broken. (stap version is 3.2/0.170 and uname -r shows
3.10.0-862)

$ sudo stap -e  'probe process(".../bin/postgres").mark("transaction__start"){}'
...
LOG:  autovacuum launcher process (PID 10592) was terminated by signal 4: 
Illegal instruction


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PATCH: backtraces for error messages

2018-06-20 Thread Michael Paquier
On Thu, Jun 21, 2018 at 12:35:10PM +0800, Craig Ringer wrote:
> I wrote it because I got sick of Assert(false) debugging, and I was chasing
> down some "ERROR:  08P01: insufficient data left in message" errors. Then I
> got kind of caught up in it... you know how it is.

Yes, I know that feeling!  I have been using as well the Assert(false)
and the upgrade-to-PANIC tricks a couple of times, so being able to get
more easily backtraces would be really nice.

> It also goes to show there are plenty of places you might want to get a
> stack where you don't have an internal errcode or panic. I don't think that
> idea will fly.

Yep.  Error message uniqueness can help, but getting the call stack
can trace back to more understanding of a problem.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: backtraces for error messages

2018-06-20 Thread Craig Ringer
This is what the stacks look like btw


[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] DEBUG:  0:
find_in_dynamic_libpath: trying
"/home/craig/pg/10/lib/postgresql/pglogical.so"
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] LOCATION:
find_in_dynamic_libpath, dfmgr.c:639
[2018-06-21 12:26:45.309 AWST] [7293] [] [] [:0] STACK:
FRAME0: find_in_dynamic_libpath +628
FRAME1: expand_dynamic_library_name +205
FRAME2: load_external_function +38
FRAME3: LookupBackgroundWorkerFunction +197
FRAME4: StartBackgroundWorker +549
FRAME5: do_start_bgworker +466
FRAME6: maybe_start_bgworkers +382
FRAME7: reaper +895
FRAME8: funlockfile +80
FRAME9: __select +23
FRAME   10: ServerLoop +394
FRAME   11: PostmasterMain +4499

I wrote it because I got sick of Assert(false) debugging, and I was chasing
down some "ERROR:  08P01: insufficient data left in message" errors. Then I
got kind of caught up in it... you know how it is.

It also goes to show there are plenty of places you might want to get a
stack where you don't have an internal errcode or panic. I don't think that
idea will fly.


Re: PATCH: backtraces for error messages

2018-06-20 Thread Craig Ringer
On 21 June 2018 at 01:15, Andres Freund  wrote:

> On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
> > On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund 
> wrote:
> > > If we instead had a backtrace enabled for all PANICs and some FATALs by
> > > default (and perhaps a SIGSEGV handler too), we'd be in a better
> > > place. That'd obviously only work when compiled with support for
> > > libraries, on platforms where we added support for that. But I think
> > > that'd be quite the improvement already.
> >
> > I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
> > if we can make it safe enough, which I'm not sure about, but then I
> > suppose we're crashing anyway.
>
> Yea, I think that's pretty much why It'd be ok.


Yep. At worst we crash again while trying to generate a bt. We're not doing
anything particularly exciting, and it should be sensible enough.

> Instead of making the ERROR behavior conditional on
> > log_error_verbosity as Craig has it now, how about doing it whenever
> > the error code is ERRCODE_INTERNAL_ERROR?  That's pretty much the
> > cases that aren't supposed to happen, so if we see those happening a
> > lot, it's either a bug we need to fix or we should supply a better
> > error code.  Also, a lot of those messages are duplicated in many
> > places and/or occur inside fairly generic functions inside
> > lsyscache.c, so the actual error message text tends not to be enough
> > to know what happened.
>
> I don't think that's ok. It's perfectly possible to hit
> ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
> there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
> this.


Perhaps we should fix those, but it might be a game of whack-a-mole as the
code changes, and inevitably you'll want to generate stacks for some other
errcode while getting frustrated at all the ERRCODE_INTERNAL_ERROR. Not
sure it's worth it.

However, maybe a  GUC like

log_error_stacks_errcodes = 'XX000, 55000'

would work. It'd be somewhat expensive to evaluate, but we'd only be doing
it where we'd already decided to emit an error. And it'd fit in even if we
later added smarter selective logging.

BTW, it's worth noting that these backtraces are very limited. They don't
report arguments or locals. So it's still no substitute for suitable
errcontext callbacks, and sometimes it's still necessary to fall back to
gdb or messing around with perf userspace tracepoints.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


RE: PATCH: backtraces for error messages

2018-06-20 Thread Tsunakawa, Takayuki
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> I have no idea how expensive backtrace() and libunwind are, but I doubt
> we want to pay the cost for every message before we know that error
> requires the backtrace to be collected.  Something like PGC_USERSET
>   server_min_backtraces=PANIC
> might be a possible interface.

+1

In addition, it would be nicer if we could have some filtering based on the 
error condition.  Sometimes I wanted to know where the "out of memory" or 
"invalid memory alloc request size ..." was thrown.

How about adding a GUC like the -e option of strace?  strace allows to specify 
which system calls and groups of them to capture.  We can use the SQL state and 
class code/name to list the conditions to capture.  In the out-of-memory case, 
I want to specify 53200, but don't want to get stack traces for other errors in 
class 53.

https://www.postgresql.org/docs/devel/static/errcodes-appendix.html

Regards
Takayuki Tsunakawa







Re: PATCH: backtraces for error messages

2018-06-20 Thread Robert Haas
On Wed, Jun 20, 2018 at 1:15 PM, Andres Freund  wrote:
> I don't think that's ok. It's perfectly possible to hit
> ERRCODE_INTERNAL_ERROR at a high frequency in some situations,

Really?  How?

> and
> there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
> this. E.g. a lot of generic filesystem errors have
> errcode_for_file_access(), but are too generic messages to debug.  So I
> think we'll just need a separate GUC for things that aren't PANIC and
> haven't explicitly opt-ed in.

Mmph.  I don't like that much.  I mean I can hold my nose, but I hope
we can find a way to do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: backtraces for error messages

2018-06-20 Thread Daniel Gustafsson
> On 20 Jun 2018, at 17:10, Craig Ringer  wrote:

> BTW, Álvaro posted a simpler patch at 
> https://www.postgresql.org/message-id/20180410213203.nl645o5vj5ap66sl@alvherre.pgsql.
>  It uses backtrace() from glibc, and requires each site you want to bt to be 
> annotated explicitly. I forgot about backtrace() completely when I wrote 
> mine, and I prefer the control libunwind gives me anyway, but the reduced 
> dependency would be nice. Especially since backtrace() is in FreeBSD too.

Just as a datapoint regarding this; backtrace() is not available in OpenBSD,
but there is a library in ports which implements it, libexecinfo.

cheers ./daniel


Re: PATCH: backtraces for error messages

2018-06-20 Thread Andres Freund
On 2018-06-20 13:10:57 -0400, Robert Haas wrote:
> On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund  wrote:
> > If we instead had a backtrace enabled for all PANICs and some FATALs by
> > default (and perhaps a SIGSEGV handler too), we'd be in a better
> > place. That'd obviously only work when compiled with support for
> > libraries, on platforms where we added support for that. But I think
> > that'd be quite the improvement already.
> 
> I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
> if we can make it safe enough, which I'm not sure about, but then I
> suppose we're crashing anyway.

Yea, I think that's pretty much why It'd be ok.


> Instead of making the ERROR behavior conditional on
> log_error_verbosity as Craig has it now, how about doing it whenever
> the error code is ERRCODE_INTERNAL_ERROR?  That's pretty much the
> cases that aren't supposed to happen, so if we see those happening a
> lot, it's either a bug we need to fix or we should supply a better
> error code.  Also, a lot of those messages are duplicated in many
> places and/or occur inside fairly generic functions inside
> lsyscache.c, so the actual error message text tends not to be enough
> to know what happened.

I don't think that's ok. It's perfectly possible to hit
ERRCODE_INTERNAL_ERROR at a high frequency in some situations, and
there's plenty cases that aren't ERRCODE_INTERNAL_ERROR where we'd want
this. E.g. a lot of generic filesystem errors have
errcode_for_file_access(), but are too generic messages to debug.  So I
think we'll just need a separate GUC for things that aren't PANIC and
haven't explicitly opt-ed in.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Robert Haas
On Wed, Jun 20, 2018 at 12:10 PM, Andres Freund  wrote:
> I think the problem is that this most frequently is an issue when
> investigating an issue for customers. Often enough it'll legally not be
> possible to gain direct access to the system, and remotely instructing
> people to install an extension and configure it isn't great.  So if we
> could, by default, include something better for PANICs etc, it'd be a
> great boon - I think that's even clear from conversionations on the pg
> lists (which often will be the more knowledgable people: How often did
> we try hard to get a backtrace from a crash?

+1 to all of that.  This is a real nuisance, and making it less of a
nuisance would be great.

> If we instead had a backtrace enabled for all PANICs and some FATALs by
> default (and perhaps a SIGSEGV handler too), we'd be in a better
> place. That'd obviously only work when compiled with support for
> libraries, on platforms where we added support for that. But I think
> that'd be quite the improvement already.

I think doing it on PANIC would be phenomenal.  SIGSEGV would be great
if we can make it safe enough, which I'm not sure about, but then I
suppose we're crashing anyway.  Instead of making the ERROR behavior
conditional on log_error_verbosity as Craig has it now, how about
doing it whenever the error code is ERRCODE_INTERNAL_ERROR?  That's
pretty much the cases that aren't supposed to happen, so if we see
those happening a lot, it's either a bug we need to fix or we should
supply a better error code.  Also, a lot of those messages are
duplicated in many places and/or occur inside fairly generic functions
inside lsyscache.c, so the actual error message text tends not to be
enough to know what happened.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: PATCH: backtraces for error messages

2018-06-20 Thread Andres Freund
On 2018-06-20 12:04:51 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
> >> I have no idea how expensive backtrace() and libunwind are, but I doubt
> >> we want to pay the cost for every message before we know that error
> >> requires the backtrace to be collected.  Something like PGC_USERSET
> >> server_min_backtraces=PANIC 
> >> might be a possible interface.
> 
> > Yes, most definitely. We can't do this everywhere. It's quite expensive
> > to collect / build them.  So I think we'd probably need another guc that
> > controls when the information is collected, perhaps defaulting to PANIC.
> 
> The cost is a problem, and the dependencies on various additional
> libraries are too.  I wonder whether anything could be gained by
> putting this stuff in an extension?  Then we could have different
> extensions for different backtrace libraries, and loading the extension
> or not would be one avenue to control whether you were paying the cost.
> (Though some control GUCs might be needed anyway.)

I think the problem is that this most frequently is an issue when
investigating an issue for customers. Often enough it'll legally not be
possible to gain direct access to the system, and remotely instructing
people to install an extension and configure it isn't great.  So if we
could, by default, include something better for PANICs etc, it'd be a
great boon - I think that's even clear from conversionations on the pg
lists (which often will be the more knowledgable people: How often did
we try hard to get a backtrace from a crash?

If we instead had a backtrace enabled for all PANICs and some FATALs by
default (and perhaps a SIGSEGV handler too), we'd be in a better
place. That'd obviously only work when compiled with support for
libraries, on platforms where we added support for that. But I think
that'd be quite the improvement already.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
>> I have no idea how expensive backtrace() and libunwind are, but I doubt
>> we want to pay the cost for every message before we know that error
>> requires the backtrace to be collected.  Something like PGC_USERSET
>> server_min_backtraces=PANIC 
>> might be a possible interface.

> Yes, most definitely. We can't do this everywhere. It's quite expensive
> to collect / build them.  So I think we'd probably need another guc that
> controls when the information is collected, perhaps defaulting to PANIC.

The cost is a problem, and the dependencies on various additional
libraries are too.  I wonder whether anything could be gained by
putting this stuff in an extension?  Then we could have different
extensions for different backtrace libraries, and loading the extension
or not would be one avenue to control whether you were paying the cost.
(Though some control GUCs might be needed anyway.)

regards, tom lane



Re: PATCH: backtraces for error messages

2018-06-20 Thread Andres Freund
Hi,

On 2018-06-20 11:20:49 -0400, Alvaro Herrera wrote:
> > I recently needed a way to get backtraces from errors in a convenient,
> > non-interactive and indescriminate way. The attached patch is the result.
> > It teaches Pg to use libunwind to self-backtrace in elog/ereport.
> > 
> > Anyone think this is useful/interesting/worth pursuing?

Generally interesting, yes.


> I think we sorely need some mechanism to optionally get backtraces in
> error messages.  I think having backtraces in all messages is definitely
> not a good thing, but requiring an explicit marker (such as in my patch)
> means the code has to be recompiled, which is not a solution in
> production systems.  I kind lean towards your approach, but it has to be
> something that's easily enabled/disabled at runtime.

> I have no idea how expensive backtrace() and libunwind are, but I doubt
> we want to pay the cost for every message before we know that error
> requires the backtrace to be collected.  Something like PGC_USERSET
>   server_min_backtraces=PANIC 
> might be a possible interface.

Yes, most definitely. We can't do this everywhere. It's quite expensive
to collect / build them.  So I think we'd probably need another guc that
controls when the information is collected, perhaps defaulting to PANIC.

Greetings,

Andres Freund



Re: PATCH: backtraces for error messages

2018-06-20 Thread Alexander Kuzmenkov

On 06/20/2018 06:10 PM, Craig Ringer wrote:


I recently needed a way to get backtraces from errors in a convenient, 
non-interactive and indescriminate way. The attached patch is the 
result. It teaches Pg to use libunwind to self-backtrace in elog/ereport.


Anyone think this is useful/interesting/worth pursuing?



This would be a great thing to have. I often need to add stack traces to 
the logs, and for now I just link with libunwind and add some ad-hoc 
function to grab the traces. Having libunwind support in core would make 
this simpler.



--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: PATCH: backtraces for error messages

2018-06-20 Thread Alvaro Herrera
On 2018-Jun-20, Craig Ringer wrote:

> Hi folks
> 
> I recently needed a way to get backtraces from errors in a convenient,
> non-interactive and indescriminate way. The attached patch is the result.
> It teaches Pg to use libunwind to self-backtrace in elog/ereport.
> 
> Anyone think this is useful/interesting/worth pursuing?

I think we sorely need some mechanism to optionally get backtraces in
error messages.  I think having backtraces in all messages is definitely
not a good thing, but requiring an explicit marker (such as in my patch)
means the code has to be recompiled, which is not a solution in
production systems.  I kind lean towards your approach, but it has to be
something that's easily enabled/disabled at runtime.

I have no idea how expensive backtrace() and libunwind are, but I doubt
we want to pay the cost for every message before we know that error
requires the backtrace to be collected.  Something like PGC_USERSET
  server_min_backtraces=PANIC 
might be a possible interface.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services