Re: The plot thickens (Re: Segmentation fault cause)

2020-08-28 Thread Stephen R. van den Berg
The fixes in master work, in that there are no segmentation faults anymore,
however, now I regularly get this:

(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f010403c537 in __GI_abort () at abort.c:79
#2  0x56469288b3c0 in debug_va_fatal (
fmt=0x564692a7c4b8 "Recursive Pike_error() calls, original error: %s, new 
error: %s", args=args@entry=0x7f0101258768) at /home/srb/pike/src/error.c:575
#3  0x564692889bf6 in debug_fatal (
fmt=fmt@entry=0x564692a7c4b8 "Recursive Pike_error() calls, original error: 
%s, new error: %s") at /home/srb/pike/src/error.c:583
#4  0x56469288a621 in va_error (
fmt=0x564692a8ba50 "Cannot access variables in destructed object.\n", 
args=args@entry=0x7f0101258868) at /home/srb/pike/src/error.c:289
#5  0x56469288a859 in Pike_error (
fmt=fmt@entry=0x564692a8ba50 "Cannot access variables in destructed 
object.\n") at /home/srb/pike/src/error.c:332
#6  0x5646928f83af in low_object_index_no_free (to=, 
o=, f=) at /home/srb/pike/src/object.c:1373
#7  0x7f01010e9924 in ?? ()
#8  0x0301 in ?? ()
#9  0x7f01010e93b8 in ?? ()
#10 0x0019 in ?? ()
#11 0x in ?? ()
(gdb) 

Can we somehow get rid the recursion in Pike_error() in this case?
-- 
Stephen.


Re: The plot thickens (Re: Segmentation fault cause)

2020-07-23 Thread Tobias S. Josefowitz @ Pike developers forum
>>#ifdef PIKE_USE_MACHINE_CODE
>>  call_check_threads_etc();
>>#endif
>
>>I assume you have machine code enabled. call_check_threads_etc()
>
>Normally I do, however, to get to the bottom of this, I temporarily compiled
>with:
>gcc-9 -g -O1 -pipe -DPIKE_DEBUG=1
>  --with-cdebug --with-rtldebug --with-valgrind
>  --with-double-precision --with-long-int \
>  --disable-noopty-retry \
>  --without-machine-code \
>  --with-poll \
>  --with-portable-bytecode \

Well, that actually makes it more interesting (in you could say a bad
sense), because without that call to call_check_threads_etc() we once
more have no explanation for how the object could have been destructed
between interpret.c:2403 and interpret.c:2506.

Actually no,

2503 #ifndef PIKE_USE_MACHINE_CODE
2504   FAST_CHECK_THREADS_ON_CALL();
2505 #endif

would then be to blame. So ok, not so mysterious.

>
>>Now, what to do about it... indeed check at every function entry that
>>we're not destructed? I don't know, but that just doesn't feel
>>cool.
>
>Actually, it seems like Pike actually does this check before *every* access
>of a local variable.  I.e. in those numerous cases I had to deal with in pgsql
>I invariably got an exception like "lookup in destructed object".
>With that in mind, a check upon function entry does not sound so bad.

Well, I mean, so either we can manually do it in every "C function",
or we can just have "Pike" prevent it from happening (I'm assuming
here there aren't any cases in which we actualy want or even need to
call functions in destructed objects). In which case I'd prefer to not
having to manually check in each case if

* my function might actually be safe to be called in a destructed
  object, or
* my object is destructed.

>
>P.S. Speaking about asynchronous destructs.  One of the "fun" things
> I discovered about three months ago was that since the actual destruct()
> method can also be called while being inside a totally random stackframe
> (very far beyond the stackframe where the object scope actually already
> ended), it is quite hazardous to try to acquire any kind of mutex
> from within a destruct() method, since it can result in very random
> and very rare deadlock situations.
>-- 
>Stephen.
>


Re: The plot thickens (Re: Segmentation fault cause)

2020-07-23 Thread Stephen R. van den Berg
Another core stacktrace (core5):

[New LWP 31798]
[New LWP 28691]
[New LWP 31695]
[New LWP 31738]
[New LWP 31703]
[New LWP 31730]
[New LWP 31707]
[New LWP 31717]
[New LWP 31711]
[New LWP 31725]
[New LWP 31699]
[New LWP 31673]
[New LWP 31367]
[New LWP 31799]
[New LWP 31694]
[New LWP 31715]
[New LWP 31742]
[New LWP 31800]
[New LWP 31801]
[New LWP 31721]
[New LWP 31731]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/pike /home/spike.git/spike -n background'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7ff85afff700 (LWP 31798))]
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7ff85d2ce537 in __GI_abort () at abort.c:79
#2  0x564bfdab76b9 in debug_va_fatal (
fmt=0x564bfdd43bc8 "Stdio.Buffer already destructed.\n", 
args=args@entry=0x7ff85affe128) at /home/srb/pike/src/error.c:575
#3  0x564bfdab5339 in debug_fatal (
fmt=fmt@entry=0x564bfdd43bc8 "Stdio.Buffer already destructed.\n")
at /home/srb/pike/src/error.c:583
#4  0x564bfdcc5062 in already_destructed ()
at /home/srb/pike/src/modules/_Stdio/buffer.cmod:891
#5  0x564bfdccc2f7 in f_Buffer_output_to (args=1)
at /home/srb/pike/src/modules/_Stdio/buffer.cmod:1112
#6  0x564bfda1f9a6 in lower_mega_apply (args=1, o=0x564bff5f6700, fun=7)
at /home/srb/pike/src/interpret.c:2506
#7  0x564bfda2b2fd in eval_instruction_without_debug (
pc=0x564bff890c44 "8\002\\A", pc@entry=0x564bff33568c "O")
at /home/srb/pike/src/interpret_functions.h:2345
#8  0x564bfda5d8d3 in eval_instruction (pc=0x564bff33568c "O")
at /home/srb/pike/src/interpret.c:2012
#9  catching_eval_instruction (pc=pc@entry=0x564bff33568c "O")
at /home/srb/pike/src/interpret.c:3082
#10 0x564bfda27e17 in eval_instruction_without_debug (pc=)
at /home/srb/pike/src/interpret_functions.h:1502
#11 0x564bfda2f039 in eval_instruction (pc=)
at /home/srb/pike/src/interpret.c:2012
--Type  for more, q to quit, c to continue without paging--
#12 mega_apply (type=type@entry=APPLY_STACK, args=args@entry=1, 
arg1=arg1@entry=0x0, arg2=arg2@entry=0x0)
at /home/srb/pike/src/interpret.c:3023
#13 0x564bfda5e70d in f_call_function (args=args@entry=1)
at /home/srb/pike/src/interpret.c:3103
#14 0x564bfdbba12d in new_thread_func (data=)
at /home/srb/pike/src/threads.c:1832
#15 0x7ff85d478ea7 in start_thread (arg=)
at pthread_create.c:477
#16 0x7ff85d3a6dcf in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

-- 
Stephen.


Re: The plot thickens (Re: Segmentation fault cause)

2020-07-23 Thread Stephen R. van den Berg
Tobias S. Josefowitz @ Pike developers forum wrote:
>>Am I reading those correctly that both are upon thread creation?

>First, nice backtrace. I think it's more that any thread started ever
>will have been started by thread creation, and that's what we're
>seeing here, not that this necessarily immediately follows thread
>creation.

That was the other explanation, but since I don't know enough about the
innards of Pike in this respect, I wasn't quite sure.

>#ifdef PIKE_USE_MACHINE_CODE
>  call_check_threads_etc();
>#endif

>I assume you have machine code enabled. call_check_threads_etc()

Normally I do, however, to get to the bottom of this, I temporarily compiled
with:
gcc-9 -g -O1 -pipe -DPIKE_DEBUG=1
  --with-cdebug --with-rtldebug --with-valgrind
  --with-double-precision --with-long-int \
  --disable-noopty-retry \
  --without-machine-code \
  --with-poll \
  --with-portable-bytecode \

>indeed may schedule other threads and stuff and we may return from it
>with the object we just looked up the identifier in destructed. When
>we then call the identifier, we will indeed call into a destructed
>object.

I already had the distinct impression this was happening.  In the pgsql driver
I've had numerous issues over the years where I had to cover for methods
running in destructed objects (because the driver is asynchronous to the bone).
But in most cases this did not result in segfaults, so I'm not quite sure
if the rest of the Pike system is more robust against this naturally.
I have one unexplained segfault there of about six months ago, but that was
so hard to reproduce, that it might as well have been the same problem I'm
chasing now.

>Now, what to do about it... indeed check at every function entry that
>we're not destructed? I don't know, but that just doesn't feel
>cool.

Actually, it seems like Pike actually does this check before *every* access
of a local variable.  I.e. in those numerous cases I had to deal with in pgsql
I invariably got an exception like "lookup in destructed object".
With that in mind, a check upon function entry does not sound so bad.

P.S. Speaking about asynchronous destructs.  One of the "fun" things
 I discovered about three months ago was that since the actual destruct()
 method can also be called while being inside a totally random stackframe
 (very far beyond the stackframe where the object scope actually already
 ended), it is quite hazardous to try to acquire any kind of mutex
 from within a destruct() method, since it can result in very random
 and very rare deadlock situations.
-- 
Stephen.


The plot thickens (Re: Segmentation fault cause)

2020-07-23 Thread Tobias S. Josefowitz @ Pike developers forum
>[New LWP 15651]
>[New LWP 15621]
>[New LWP 13738]
>[New LWP 15642]
>[Thread debugging using libthread_db enabled]
>Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>Core was generated by `/usr/local/bin/pike /home/spike.git/spike -n 
>background'.
>Program terminated with signal SIGABRT, Aborted.
>#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>[Current thread is 1 (Thread 0x7f85113ec700 (LWP 15651))]
>(gdb) where
>#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>#1  0x7f85140e055b in __GI_abort () at abort.c:79
>#2  0x561705dfb6b9 in debug_va_fatal (
>fmt=0x561706087bc8 "Stdio.Buffer already destructed.\n", 
>args=args@entry=0x7f85113eb1b8) at /home/srb/pike/src/error.c:575
>#3  0x561705df9339 in debug_fatal (
>fmt=fmt@entry=0x561706087bc8 "Stdio.Buffer already destructed.\n")
>at /home/srb/pike/src/error.c:583
>#4  0x561706009062 in already_destructed ()
>at /home/srb/pike/src/modules/_Stdio/buffer.cmod:891
>#5  0x56170600f7ed in f_Buffer_add (args=)
>at /home/srb/pike/src/modules/_Stdio/buffer.cmod:1227
>#6  0x561705d639a6 in lower_mega_apply (args=args@entry=1, 
>o=o@entry=0x561706e266e0, fun=11) at /home/srb/pike/src/interpret.c:2506
>#7  0x561705d6fd91 in eval_instruction_without_debug (
>pc=0x561706d776c8 
> "\035\232\337\030-\n\022\336\031\070\n]\177W\001\001\070", 
>pc@entry=0x561706d776ac "O") at 
> /home/srb/pike/src/interpret_functions.h:2424
>#8  0x561705da18d3 in eval_instruction (pc=0x561706d776ac "O")
>at /home/srb/pike/src/interpret.c:2012
>#9  catching_eval_instruction (pc=pc@entry=0x561706d776ac "O")
>at /home/srb/pike/src/interpret.c:3082
>#10 0x561705d6be17 in eval_instruction_without_debug (pc=)
>at /home/srb/pike/src/interpret_functions.h:1502
>#11 0x561705d73039 in eval_instruction (pc=)
>at /home/srb/pike/src/interpret.c:2012
>--Type  for more, q to quit, c to continue without paging-- 
>#12 mega_apply (type=type@entry=APPLY_STACK, args=args@entry=1, 
>arg1=arg1@entry=0x0, arg2=arg2@entry=0x0)
>at /home/srb/pike/src/interpret.c:3023
>#13 0x561705da270d in f_call_function (args=args@entry=1)
>at /home/srb/pike/src/interpret.c:3103
>#14 0x561705efe12d in new_thread_func (data=)
>at /home/srb/pike/src/threads.c:1832
>#15 0x7f8514288f27 in start_thread (arg=)
>at pthread_create.c:479
>#16 0x7f85141b831f in clone ()
>at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>
>It calls add() on a destructed Buffer object.
>
>Am I reading those correctly that both are upon thread creation?

First, nice backtrace. I think it's more that any thread started ever
will have been started by thread creation, and that's what we're
seeing here, not that this necessarily immediately follows thread
creation.

Second, if we have a look at lower_mega_apply (#6), we find that it
looks up the identifier in the object first, then does

#ifdef PIKE_USE_MACHINE_CODE
  call_check_threads_etc();
#endif

I assume you have machine code enabled. call_check_threads_etc()
indeed may schedule other threads and stuff and we may return from it
with the object we just looked up the identifier in destructed. When
we then call the identifier, we will indeed call into a destructed
object.

I am unsure about how exactly not that great that is, maybe grubba has
some thoughts on this...

So, before calling call_check_threads_etc() we set up our new
stackframe, meaning we keep a reference on both the object and program
in question. This means our storage, i.e. CVAR Buffer b; aka Buffer
*io; will not be freed even though the object is destructed, and
*keeps being valid*, except in so far as our actions in EXIT
contradict this notion (especially io_unlink_external_storage( io );
and free( io->buffer ); probaly indeed make it really unsafe to
execute our PIKEFUNs after destruction).

Now, what to do about it... indeed check at every function entry that
we're not destructed? I don't know, but that just doesn't feel
cool. Prevent functions being called in destructed objects
more/better?


The plot thickens (Re: Segmentation fault cause)

2020-07-23 Thread Stephen R. van den Berg
Stephen R. van den Berg wrote:
>Well, this experiment failed.  I took out my reverts locally, and now
>changed the diagnostic Pike_error() in case of the already-destructed call
>into a Pike_fatal(), in hopes of finding out when it is being triggered
>from the next coredump.

This is what I find:

Coredump 1:
[New LWP 13725]
[New LWP 13648]
[New LWP 13665]
[New LWP 32741]
[New LWP 13572]
[New LWP 13650]
[New LWP 13634]
[New LWP 13726]
[New LWP 13668]
[New LWP 13671]
[New LWP 13653]
[New LWP 13656]
[New LWP 13660]
[New LWP 13727]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/pike /home/spike.git/spike -n background'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7ff24b488700 (LWP 13725))]
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7ff24fbda55b in __GI_abort () at abort.c:79
#2  0x55a1069066b9 in debug_va_fatal (
fmt=0x55a106b92bc8 "Stdio.Buffer already destructed.\n", 
args=args@entry=0x7ff24b4871b8) at /home/srb/pike/src/error.c:575
#3  0x55a106904339 in debug_fatal (
fmt=fmt@entry=0x55a106b92bc8 "Stdio.Buffer already destructed.\n")
at /home/srb/pike/src/error.c:583
#4  0x55a106b14062 in already_destructed ()
at /home/srb/pike/src/modules/_Stdio/buffer.cmod:891
#5  0x55a106b161ae in f_Buffer_cq__sizeof (args=)
at /home/srb/pike/src/modules/_Stdio/buffer.cmod:1596
#6  0x55a10686e9a6 in lower_mega_apply (args=args@entry=0, 
o=o@entry=0x55a10843d640, fun=fun@entry=21)
at /home/srb/pike/src/interpret.c:2506
#7  0x55a1068ac990 in mega_apply_low (args=args@entry=0, 
arg1=0x55a10843d640, arg2=21) at /home/srb/pike/src/interpret.c:3048
#8  0x55a106a5c9b7 in pike_sizeof (s=0x7ff24dcdd160)
at /home/srb/pike/src/svalue.c:2638
#9  0x55a106879fed in eval_instruction_without_debug (
pc=0x55a107a65f0b "k\b", pc@entry=0x55a107c4cb0c "O")
at /home/srb/pike/src/interpret_functions.h:2190
#10 0x55a1068ac8d3 in eval_instruction (pc=0x55a107c4cb0c "O")
at /home/srb/pike/src/interpret.c:2012
#11 catching_eval_instruction (pc=pc@entry=0x55a107c4cb0c "O")
--Type  for more, q to quit, c to continue without paging--
at /home/srb/pike/src/interpret.c:3082
#12 0x55a106876e17 in eval_instruction_without_debug (pc=)
at /home/srb/pike/src/interpret_functions.h:1502
#13 0x55a10687e039 in eval_instruction (pc=)
at /home/srb/pike/src/interpret.c:2012
#14 mega_apply (type=type@entry=APPLY_STACK, args=args@entry=1, 
arg1=arg1@entry=0x0, arg2=arg2@entry=0x0)
at /home/srb/pike/src/interpret.c:3023
#15 0x55a1068ad70d in f_call_function (args=args@entry=1)
at /home/srb/pike/src/interpret.c:3103
#16 0x55a106a0912d in new_thread_func (data=)
at /home/srb/pike/src/threads.c:1832
#17 0x7ff24fd82f27 in start_thread (arg=)
at pthread_create.c:479
#18 0x7ff24fcb231f in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


I lost the coredump on this one (overwritten by the next).
It hits sizeof() a buffer object that has been destroyed.

Then we have the following next one (I saved the core, so I can get
more information from this one):

[New LWP 15651]
[New LWP 15621]
[New LWP 13738]
[New LWP 15642]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/local/bin/pike /home/spike.git/spike -n background'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f85113ec700 (LWP 15651))]
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x7f85140e055b in __GI_abort () at abort.c:79
#2  0x561705dfb6b9 in debug_va_fatal (
fmt=0x561706087bc8 "Stdio.Buffer already destructed.\n", 
args=args@entry=0x7f85113eb1b8) at /home/srb/pike/src/error.c:575
#3  0x561705df9339 in debug_fatal (
fmt=fmt@entry=0x561706087bc8 "Stdio.Buffer already destructed.\n")
at /home/srb/pike/src/error.c:583
#4  0x561706009062 in already_destructed ()
at /home/srb/pike/src/modules/_Stdio/buffer.cmod:891
#5  0x56170600f7ed in f_Buffer_add (args=)
at /home/srb/pike/src/modules/_Stdio/buffer.cmod:1227
#6  0x561705d639a6 in lower_mega_apply (args=args@entry=1, 
o=o@entry=0x561706e266e0, fun=11) at /home/srb/pike/src/interpret.c:2506
#7  0x561705d6fd91 in eval_instruction_without_debug (
pc=0x561706d776c8 
"\035\232\337\030-\n\022\336\031\070\n]\177W\001\001\070", 
pc@entry=0x561706d776ac "O") at