subject:"\[racket\-users\] Another futures\-related bug hunt"

Re: [racket-users] Another futures-related bug hunt

2020-05-09 Thread Dominik Pantůček


Hello,



I'll add a lock at lines 1092-1096 of "newgc.c", and we'll see if that
helps.


should I open the issue or will you do it? (Speaking of race conditions...).

I'll re-run the tests with the lock once it is in the repo - sometimes 
it takes hours for this bug to exhibit and with 8 HTs the process in 
question consumes slightly more than 500% of CPU time - which means the 
computer sounds it's going to take off and fly. I'll keep it up and 
running overnight again.



And thank you for the explanation, digging in Racket internals has a 
very varying degree of difficulty :)



Cheers,
Dominik

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/606315e6-edd3-1ab5-48b2-770b6fd79893%40trustica.cz.

Re: [racket-users] Another futures-related bug hunt

2020-05-09 Thread Matthew Flatt

At Sat, 9 May 2020 07:18:01 +0200, Dominik Pantůček wrote:
> would this be enough to open an issue for that?
> 
> (gdb) info threads
>Id   Target IdFrame
> * 1Thread 0x77c1b300 (LWP 19075) "tut22.rkt" 
> mark_backpointers (gc=gc@entry=0x559d10c0) at 
> ../../../racket/gc2/newgc.c:4078

Yes, this might identify the problem. Being stuck in a linked-list
iteration often means that there was a race updating the list.

The GC's write barrier is implemented by write-protecting pages and
handling SIGSEGV to record the modification (and remove write
protection until the next GC). If that handler is called in different
future threads, though, then there's currently a race on the list of
modified pages.

This race doesn't happen with places, because different places have
different GC instances. And it won't happen on Mac OS, because the
fault is handled at the Mach layer and routes exceptions for all
threads to a single handler thread.

I'll add a lock at lines 1092-1096 of "newgc.c", and we'll see if that
helps.

Thanks very much for your help!
Matthew

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/5eb6acea.1c69fb81.f6df9.aa57SMTPIN_ADDED_MISSING%40gmr-mx.google.com.

Re: [racket-users] Another futures-related bug hunt

2020-05-08 Thread Dominik Pantůček


Hello,



The most useful information here is likely to be a stack trace from
each OS-level thread at the point where the application is stuck.



would this be enough to open an issue for that?

(gdb) info threads
  Id   Target IdFrame
* 1Thread 0x77c1b300 (LWP 19075) "tut22.rkt" 
mark_backpointers (gc=gc@entry=0x559d10c0) at 
../../../racket/gc2/newgc.c:4078
  2Thread 0x77fcb700 (LWP 19076) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559d7d78)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  3Thread 0x7fffe65a6700 (LWP 19077) "gmain" 
0x77d34c2f in __GI___poll (fds=0x55b82520, nfds=2, 
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  4Thread 0x7fffe5da5700 (LWP 19078) "gdbus" 
0x77d34c2f in __GI___poll (fds=0x55b94ce0, nfds=3, 
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  7Thread 0x7fffd77fe700 (LWP 19082) "dconf worker" 
0x77d34c2f in __GI___poll (fds=0x55e9e5e0, nfds=1, 
timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  8Thread 0x7fffe40d4800 (LWP 19083) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  9Thread 0x7fffd4602800 (LWP 19084) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  10   Thread 0x7fffd4586800 (LWP 19085) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  11   Thread 0x7fffd450a800 (LWP 19086) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  12   Thread 0x7fffd448e800 (LWP 19087) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  13   Thread 0x7fffd4412800 (LWP 19088) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  14   Thread 0x7fffd4396800 (LWP 19089) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  15   Thread 0x7fffd431a800 (LWP 19090) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c499c)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
  16   Thread 0x765b2800 (LWP 21691) "tut22.rkt" 
futex_wait_cancelable (private=, expected=0, 
futex_word=0x559c4998)

at ../sysdeps/unix/sysv/linux/futex-internal.h:80
(gdb) bt
#0  0x557f5064 in mark_backpointers (gc=gc@entry=0x559d10c0) 
at ../../../racket/gc2/newgc.c:4078
#1  0x557edb2b in garbage_collect (gc=gc@entry=0x559d10c0, 
force_full=force_full@entry=0, no_full=no_full@entry=0, 
switching_master=switching_master@entry=0, lmi=lmi@entry=0x0)

at ../../../racket/gc2/newgc.c:5646
#2  0x557f0ff2 in collect_now (nomajor=, 
major=, gc=) at 
../../../racket/gc2/newgc.c:875
#3  0x557f0ff2 in collect_now (gc=0x559d10c0, major=0, 
nomajor=0) at ../../../racket/gc2/newgc.c:855
#4  0x557f9124 in allocate_slowpath (newptr=, 
allocate_size=, gc=) at 
../../../racket/gc2/newgc.c:1607
#5  0x557f9124 in allocate (type=1, request_size=out>) at ../../../racket/gc2/newgc.c:1671
#6  0x557f9124 in allocate (type=, 
request_size=) at ../../../racket/gc2/newgc.c:1636
#7  0x557f9124 in GC_malloc_atomic (s=) at 
../../../racket/gc2/newgc.c:1792
#8  0x557f9124 in GC_malloc_atomic (s=) at 
../../../racket/gc2/newgc.c:1792
#9  0x55605406 in prepare_retry_alloc (p2=, 
p=) at ../../../racket/gc2/../src/jitalloc.c:47
#10 0x55605406 in ts_prepare_retry_alloc (p=, 
p2=) at ../../../racket/gc2/../src/jitalloc.c:73

#11 0x77fbe62b in  ()
#12 0x in  ()
(gdb) thread 2
[Switching to thread 2 (Thread 0x77fcb700 (LWP 19076))]
#0  futex_wait_cancelable (private=, expected=0, 
futex_word=0x559d7d78) at ../sysdeps/unix/sysv/linux/futex-internal.h:80

80  ../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) bt
#0  0x77e202c6 in futex_wait_cancelable (private=out>, expected=0, futex_word=0x559d7d78) at 
../sysdeps/unix/sysv/linux/futex-internal.h:80
#1  0x77e202c6 in __pthread_cond_wait_common (abstime=0x0, 
clockid=0, mutex=0x559d7d28, cond=0x559d7d50) at 
pthread_cond_wait.c:508
#2  0x77e202c6 in __pthread_cond_wait 
(cond=cond@entry=0x559d7d50, mutex=mutex@entry=0x559d7d28) at 
pthread_cond_wait.c:638
#3  0x557209fe in green_thread_timer 
(data=data@entry=0x559d7d10) at ../../../racket/gc2/../src/port.c:6659
#4  0x556bb8be in mzrt_thread_stub (data=0x559d7dc0) at

Re: [racket-users] Another futures-related bug hunt

2020-05-08 Thread Sam Tobin-Hochstadt

You will want to do `handle SIGSEGV nostop noprint` when you start
gdb.  Racket BC uses the SEGV handler to implement the GC write
barrier, so you'll want to skip those.

Sam

On Fri, May 8, 2020 at 9:36 AM Dominik Pantůček
 wrote:
>
> Hello,
>
> On 08. 05. 20 14:27, Matthew Flatt wrote:
> > At Fri, 8 May 2020 09:34:32 +0200, Dominik Pantůček wrote:
> >> Apart from obvious strace (after freeze) and gdb (before/after freeze)
> >> debugging to find possible sources of this bug, is there even a remote
> >> possibility of getting any clue how can this happen based on the
> >> information gathered so far? My thought go along the lines:
> >>
> >> * flonums are boxed - but for some operations they may be immediate
> >> * apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is
> >> impossible with this workload
> >> * unsafe ops are always suspicious, but again, the problem shows up even
> >> when I switch to the safe versions - it just takes longer time
> >> * which means, the most probable cause is a race condition
> >
> > The most useful information here is likely to be a stack trace from
> > each OS-level thread at the point where the application is stuck.
> >
> > That could potentially tell us, for example, that it's a problem with
> > synchronization for a GC (where one of the OS threads that run futures
> > doesn't cooperate for some reason) or a problem with a the main thread
> > performing some specific work on a future thread's behalf.
> >
>
> I am using the build from master branch with the patch for #3145 and
> cannot make it run under gdb:
>
> $ gdb ../racket-lang/racket/racket/bin/racket
> GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
> Copyright (C) 2019 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>  .
>
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ../racket-lang/racket/racket/bin/racket...
> (No debugging symbols found in ../racket-lang/racket/racket/bin/racket)
> (gdb) run
> Starting program:
> /home/joe/Projects/Programming/racket-lang/racket/racket/bin/racket
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Welcome to Racket v7.7.0.4.
> [New Thread 0x77fcb700 (LWP 6410)]
>
> Thread 1 "racket" received signal SIGSEGV, Segmentation fault.
> 0x555e14fe in scheme_gmp_tls_unload ()
> (gdb)
>
> The same happens for the binary with debug symbols:
>
>   gdb ../racket-lang/racket/racket/src/build/racket/racket3m
> GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
> Copyright (C) 2019 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later
> 
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
> Type "show copying" and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> .
> Find the GDB manual and other documentation resources online at:
>  .
>
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from
> ../racket-lang/racket/racket/src/build/racket/racket3m...
> (gdb) run
> Starting program:
> /home/joe/Projects/Programming/racket-lang/racket/racket/src/build/racket/racket3m
>
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Welcome to Racket v7.7.0.4.
> [New Thread 0x77fcb700 (LWP 6422)]
>
> Thread 1 "racket3m" received signal SIGSEGV, Segmentation fault.
> scheme_gmp_tls_unload (s=0x76114480, data=0x0) at
> ../../../racket/gc2/../src/gmp/gmp.c:5822
> 5822  s[0] = 0;
> (gdb)
>
> I am running Ubuntu 19.10's default gdb:
>
> $ gdb --version
> GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
>
> I assume gmp is used for bignum implementation (didn't check yet), so it
> might be relevant as well:
>
> ii  libgmp-dev:amd64   2:6.1.2+dfsg-4
>amd64Multiprecision arithmetic library
> developers tools
> ii  libgmp10:amd64 2:6.1.2+dfsg-4
>amd64Multiprecision arithmetic library
> ii  libgmp10:i386  2:6.1.2+dfsg-4
>

Re: [racket-users] Another futures-related bug hunt

2020-05-08 Thread Dominik Pantůček


Hello,

On 08. 05. 20 14:27, Matthew Flatt wrote:

At Fri, 8 May 2020 09:34:32 +0200, Dominik Pantůček wrote:

Apart from obvious strace (after freeze) and gdb (before/after freeze)
debugging to find possible sources of this bug, is there even a remote
possibility of getting any clue how can this happen based on the
information gathered so far? My thought go along the lines:

* flonums are boxed - but for some operations they may be immediate
* apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is
impossible with this workload
* unsafe ops are always suspicious, but again, the problem shows up even
when I switch to the safe versions - it just takes longer time
* which means, the most probable cause is a race condition


The most useful information here is likely to be a stack trace from
each OS-level thread at the point where the application is stuck.

That could potentially tell us, for example, that it's a problem with
synchronization for a GC (where one of the OS threads that run futures
doesn't cooperate for some reason) or a problem with a the main thread
performing some specific work on a future thread's behalf.



I am using the build from master branch with the patch for #3145 and 
cannot make it run under gdb:


$ gdb ../racket-lang/racket/racket/bin/racket
GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 


This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../racket-lang/racket/racket/bin/racket...
(No debugging symbols found in ../racket-lang/racket/racket/bin/racket)
(gdb) run
Starting program: 
/home/joe/Projects/Programming/racket-lang/racket/racket/bin/racket

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Welcome to Racket v7.7.0.4.
[New Thread 0x77fcb700 (LWP 6410)]

Thread 1 "racket" received signal SIGSEGV, Segmentation fault.
0x555e14fe in scheme_gmp_tls_unload ()
(gdb)

The same happens for the binary with debug symbols:

 gdb ../racket-lang/racket/racket/src/build/racket/racket3m
GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 


This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
.
Find the GDB manual and other documentation resources online at:
.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from 
../racket-lang/racket/racket/src/build/racket/racket3m...

(gdb) run
Starting program: 
/home/joe/Projects/Programming/racket-lang/racket/racket/src/build/racket/racket3m 


[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Welcome to Racket v7.7.0.4.
[New Thread 0x77fcb700 (LWP 6422)]

Thread 1 "racket3m" received signal SIGSEGV, Segmentation fault.
scheme_gmp_tls_unload (s=0x76114480, data=0x0) at 
../../../racket/gc2/../src/gmp/gmp.c:5822

5822  s[0] = 0;
(gdb)

I am running Ubuntu 19.10's default gdb:

$ gdb --version
GNU gdb (Ubuntu 8.3-0ubuntu1) 8.3

I assume gmp is used for bignum implementation (didn't check yet), so it 
might be relevant as well:


ii  libgmp-dev:amd64   2:6.1.2+dfsg-4 
  amd64Multiprecision arithmetic library 
developers tools
ii  libgmp10:amd64 2:6.1.2+dfsg-4 
  amd64Multiprecision arithmetic library
ii  libgmp10:i386  2:6.1.2+dfsg-4 
  i386 Multiprecision arithmetic library
ii  libgmpxx4ldbl:amd642:6.1.2+dfsg-4 
  amd64Multiprecision arithmetic library (C++ 
bindings)
ii  python-gmpy:amd64  1.17-4 
  amd64interfaces GMP to Python for fast, 
unbound-precision computations



I will pull latest master and re-try, but that is really just a blind guess.


Cheers,
Dominik

--
You

Re: [racket-users] Another futures-related bug hunt

2020-05-08 Thread Matthew Flatt

At Fri, 8 May 2020 09:34:32 +0200, Dominik Pantůček wrote:
> Apart from obvious strace (after freeze) and gdb (before/after freeze) 
> debugging to find possible sources of this bug, is there even a remote 
> possibility of getting any clue how can this happen based on the 
> information gathered so far? My thought go along the lines:
> 
> * flonums are boxed - but for some operations they may be immediate
> * apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is 
> impossible with this workload
> * unsafe ops are always suspicious, but again, the problem shows up even 
> when I switch to the safe versions - it just takes longer time
> * which means, the most probable cause is a race condition

The most useful information here is likely to be a stack trace from
each OS-level thread at the point where the application is stuck.

That could potentially tell us, for example, that it's a problem with
synchronization for a GC (where one of the OS threads that run futures
doesn't cooperate for some reason) or a problem with a the main thread
performing some specific work on a future thread's behalf.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/5eb5503d.1c69fb81.ae391.7291SMTPIN_ADDED_MISSING%40gmr-mx.google.com.

[racket-users] Another futures-related bug hunt

2020-05-08 Thread Dominik Pantůček


Hello fellow Racketeers,

my spare-time out-of-curiosity venture into using HPR (High-Performance 
Racket) for creating a software 3D rendering pipeline seems to be 
pushing the futures into rough edges.


The scenario is sort of "usual":

* 7 futures + 1 in RTT that form a binary tree
* GUI thread running

But this time, the futures perform not only data-heavy fixnums 
operations, but flonums operations as well.


Something along the lines of 2560x1440 fixnums and the same number of 
flonums is being handled in 8 threads effectively (give or take some 
optimizations that slightly lower the 1440 height usually).


The code in question is relatively short - say 60 lines of code - 
however it does not make much sense without the remaining 2k lines :)


If the operation runs without futures in RTT, nothing happens. But under 
a heavy load and VERY varying amount of time (seconds to hours), it 
completely freezes with:


* 1 CPU being used at 100% (top/htop shows)
* Does not handle socket operations (X11 WM message for closing the window)
* Does not respond to keyboard (or via kill) SIGINT
* Can only be forcibly stopped by SIGKILL (or similar) or forcefully 
closing the window from WM which sort of gets handled probably in the 
lower-level parts of GDK completely without Racket runtime intervention 
(just prints Killed and the exit code is 137)


Based on these observations I can only conclude that it is the RTT that 
gets stuck - but that is only the native thread perspective. From Racket 
thread perspective, it can be either the "main" application thread that 
is in (thread-wait) for the thread that performs the futures stuff and 
it can also be the GUI thread which is created with parameterizing the 
eventspace (that is just some trickery to allow me to send breaks when I 
receive window close event).


Apart from obvious strace (after freeze) and gdb (before/after freeze) 
debugging to find possible sources of this bug, is there even a remote 
possibility of getting any clue how can this happen based on the 
information gathered so far? My thought go along the lines:


* flonums are boxed - but for some operations they may be immediate
* apparently it is a busy-wait loop in RTT, otherwise 100% CPU usage is 
impossible with this workload
* unsafe ops are always suspicious, but again, the problem shows up even 
when I switch to the safe versions - it just takes longer time

* which means, the most probable cause is a race condition

And that is basically all I can tell right now.

Of course, any suggestions would be really welcome.

Cheers,
Dominik

P.S.: I am really curious, what will I find when I finally put 
fsemaphores into the mix...





--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/ca40f468-53c7-6fd2-4e7f-0d963e931a60%40trustica.cz.

Re: [racket-users] Another futures-related bug hunt

Re: [racket-users] Another futures-related bug hunt

Re: [racket-users] Another futures-related bug hunt

Re: [racket-users] Another futures-related bug hunt

Re: [racket-users] Another futures-related bug hunt

Re: [racket-users] Another futures-related bug hunt

[racket-users] Another futures-related bug hunt

7 matches

Site Navigation

Mail list logo

Footer information