Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-11 Thread Jiajun Cao
Hi Rob,

The risk that the patch has is that if a checkpoint is created in
the middle of a fork(), there may be some inconsistency. But I'm
not sure if it is the case you were describing. As I understand,
the original hanging was observed during the program launching time,
and once it passed the initialization phase (because of the patch),
there wouldn't be many calls to fork() anymore. Please correct me
if I'm wrong.

Also, I don't completely understand the issue. What does the
acknowledgment mean? Should that be handled by the application itself?
Could you post the exact error message?

Best,
Jiajun

On Wed, Aug 09, 2017 at 04:58:21PM -0700, Rob Egan wrote:
> Hi Jiajun,
> This is just an FYI.  Unsurprisingly, this patch seems to be interfering
> with a proper checkpoint / restart, and I think it is because there is now
> a race between the states of the threads at the time of the checkpoint, so
> the state of the whole program across the threads is inconsistent.
> Basically, the way I understand the script is that it creates a pool of
> workers to process a list of file and eventually move them to a new
> directory.  After restart at least one thread can't find the file anymore,
> because another thread had already moved it to its new place, but
> presumably it never got the acknowledgment from the worker thread.
> -Rob
> 
> On Mon, Aug 7, 2017 at 4:59 PM, Rob Egan  wrote:
> 
> > I can confirm that there is no hang after applying this patch.
> > Thanks,
> > Rob
> >
> > On Mon, Aug 7, 2017 at 12:04 PM, Jiajun Cao  wrote:
> >
> >> Hi Rob,
> >>
> >> Could you please try the attached patch? Basically it removes
> >> the lock/unlock operations in the fork() wrapper.
> >>
> >> Note this is a temp solution that helps us to better understand
> >> the bug. If you do not see the hanging anymore, we'll think about
> >> a more elegant fix.
> >>
> >> Best,
> >> Jiajun
> >>
> >> On Fri, Aug 04, 2017 at 03:56:05PM -0700, Rob Egan wrote:
> >> > Hi Rohan,
> >> > Great!  Thank you.  I got your email at 11:23 AM and this one at 3:15PM.
> >> > I'll watch that bug and test the patch when it is available too.
> >> > Best,
> >> > Rob
> >> >
> >> > On Fri, Aug 4, 2017 at 3:15 PM, Rohan Garg  wrote:
> >> >
> >> > > Hi Rob,
> >> > >
> >> > > For some reason, the DMTCP forum on SourceForge is acting weird;
> >> > > it rejected my last message. So I'm not sure if you received my
> >> > > last e-mail.
> >> > >
> >> > > We are looking into this actively. Jiajun has also opened an issue
> >> > > on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which
> >> > > we believe is the reason for the bug that you are seeing. Hopefully,
> >> we
> >> > > should have a fix soon.
> >> > >
> >> > > -Rohan
> >> > >
> >> > > On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote:
> >> > > > Hi Rob,
> >> > > >
> >> > > >
> >> > > > I was discussing this with Jiajun, and it seems like we might be
> >> able to
> >> > > > write a simple test program to simulate this bug. This will
> >> definitely
> >> > > > require a fix in DMTCP. Hopefully, we can provide a patch to you
> >> soon.
> >> > > >
> >> > > > I'll try to work on it this weekend. We'll see if a screen sharing
> >> > > > session is necessary.
> >> > > >
> >> > > > Thanks,
> >> > > > Rohan
> >> > > >
> >> > > > On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote:
> >> > > > > Thanks Jiajun & Rohan!
> >> > > > >
> >> > > > > So in that trace before threads 2 through 41 are all identical, so
> >> > > almost
> >> > > > > all the threads are that nanosleep.  Since this is third party
> >> > > software, my
> >> > > > > ability to change the environment and code is quite limited and
> >> python
> >> > > 3.4
> >> > > > > is the only version that I was able to get working in this
> >> environment.
> >> > > > >
> >> > > > > Jiajun's explanation sounds like the issue to me.  What would be
> >> the
> >> > > next
> >> > > > > steps to address it?  Would we need to patch dmtcp? openblas?
> >> > > > >
> >> > > > > I'd be happy to arrange a screen sharing / zoom / hangouts session
> >> > > with one
> >> > > > > or more of you.
> >> > > > >
> >> > > > > -Rob
> >> > > > >
> >> > > > > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao 
> >> wrote:
> >> > > > >
> >> > > > > > Hi Rob,
> >> > > > > >
> >> > > > > > I know what's going on here after browsing some blas
> >> > > > > > source code. As Rohan mentioned, thread 42 is behaving normally.
> >> > > > > > What looks more interesting is the calling stack of thread 1,
> >> the
> >> > > > > > main thread, especially the bottom 5 frames. It seems that the
> >> blas
> >> > > > > > library registers some handler at fork, using pthread_atfork():
> >> > > > > >
> >> > > > > > https://github.com/xianyi/OpenBLAS/blob/
> >> > > e70a6b92bf9f69f113e198e3ded558
> >> > > > > > ed0675b038/driver/others/memory.c
> >> > > > > >
> >> > > > > > The handler tries to join a bunch of other blas threads, in this
> >> > > case,
> >> > > > > > including thread 41.
> >> > > > 

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-10 Thread Rob Egan
Hi Jiajun,

This is just an FYI.  Unsurprisingly, this patch seems to be interfering
with a proper checkpoint / restart, and I think it is because there is now
a race between the states of the threads at the time of the checkpoint, so
the state of the whole program across the threads is inconsistent.
Basically, the way I understand the script is that it creates a pool of
workers to process a list of file and eventually move them to a new
directory.  After restart at least one thread can't find the file anymore,
because another thread had already moved it to its new place, but
presumably it never got the acknowledgment from the worker thread.
-Rob
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-07 Thread Jiajun Cao
Hi Rob,

Could you please try the attached patch? Basically it removes
the lock/unlock operations in the fork() wrapper.

Note this is a temp solution that helps us to better understand
the bug. If you do not see the hanging anymore, we'll think about
a more elegant fix.

Best,
Jiajun

On Fri, Aug 04, 2017 at 03:56:05PM -0700, Rob Egan wrote:
> Hi Rohan,
> Great!  Thank you.  I got your email at 11:23 AM and this one at 3:15PM.
> I'll watch that bug and test the patch when it is available too.
> Best,
> Rob
> 
> On Fri, Aug 4, 2017 at 3:15 PM, Rohan Garg  wrote:
> 
> > Hi Rob,
> >
> > For some reason, the DMTCP forum on SourceForge is acting weird;
> > it rejected my last message. So I'm not sure if you received my
> > last e-mail.
> >
> > We are looking into this actively. Jiajun has also opened an issue
> > on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which
> > we believe is the reason for the bug that you are seeing. Hopefully, we
> > should have a fix soon.
> >
> > -Rohan
> >
> > On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote:
> > > Hi Rob,
> > >
> > >
> > > I was discussing this with Jiajun, and it seems like we might be able to
> > > write a simple test program to simulate this bug. This will definitely
> > > require a fix in DMTCP. Hopefully, we can provide a patch to you soon.
> > >
> > > I'll try to work on it this weekend. We'll see if a screen sharing
> > > session is necessary.
> > >
> > > Thanks,
> > > Rohan
> > >
> > > On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote:
> > > > Thanks Jiajun & Rohan!
> > > >
> > > > So in that trace before threads 2 through 41 are all identical, so
> > almost
> > > > all the threads are that nanosleep.  Since this is third party
> > software, my
> > > > ability to change the environment and code is quite limited and python
> > 3.4
> > > > is the only version that I was able to get working in this environment.
> > > >
> > > > Jiajun's explanation sounds like the issue to me.  What would be the
> > next
> > > > steps to address it?  Would we need to patch dmtcp? openblas?
> > > >
> > > > I'd be happy to arrange a screen sharing / zoom / hangouts session
> > with one
> > > > or more of you.
> > > >
> > > > -Rob
> > > >
> > > > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao  wrote:
> > > >
> > > > > Hi Rob,
> > > > >
> > > > > I know what's going on here after browsing some blas
> > > > > source code. As Rohan mentioned, thread 42 is behaving normally.
> > > > > What looks more interesting is the calling stack of thread 1, the
> > > > > main thread, especially the bottom 5 frames. It seems that the blas
> > > > > library registers some handler at fork, using pthread_atfork():
> > > > >
> > > > > https://github.com/xianyi/OpenBLAS/blob/
> > e70a6b92bf9f69f113e198e3ded558
> > > > > ed0675b038/driver/others/memory.c
> > > > >
> > > > > The handler tries to join a bunch of other blas threads, in this
> > case,
> > > > > including thread 41.
> > > > >
> > > > > The problem is, DMTCP has its internal synchronization semantics.
> > > > > When a thread exits, it tries to grab a lock to preventing a
> > checkpoint
> > > > > from happening: it is an internal critical section. On the other
> > hand,
> > > > > when the application calls fork(), it also grabs the lock before
> > fork()
> > > > > returns: another internal critical section. That's why we see a
> > deadlock
> > > > > here:
> > > > >
> > > > > thread 1: lock acquired, and waiting for thread 41 to finish
> > > > > thread 41: trying to acquire the lock before it exits
> > > > >
> > > > > Best,
> > > > > Jiajun
> > > > >
> > > > > On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> > > > > > Hi Rob,
> > > > > >
> > > > > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
> > > > > > user process that's launched using DMTCP. It seems to be simply
> > > > > > waiting for a message from the coordinator, which is the normal
> > > > > > state of things. It's not stuck or the cause of the deadlock here.
> > > > > >
> > > > > > The deadlock is definitely because of one or more user threads. It
> > > > > > seems like the program also forks and execs some child processes.
> > > > > > We have seen some corner cases around that but it's difficult to
> > > > > > say what it is with the information below.
> > > > > >
> > > > > > The fact that even `--disable-all-plugins` didn't work is quite
> > > > > > interesting.
> > > > > >
> > > > > > I have a few other suggestions:
> > > > > >
> > > > > >  - Is it possible to share perhaps the top 20-25 frames from each
> > of
> > > > > >the threads? (You can also send them to us/me directly at:
> > > > > >dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> > > > > >
> > > > > >Also, after you save the backtraces from all the threads, detach
> > > > > >from gdb, wait for a few seconds, re-attach, and save a second
> > > > > >set of backtraces. It'll help us figure out which threads are
> > > > > >really stuck.
> > > > >

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-04 Thread Rob Egan
Hi Rohan,
Great!  Thank you.  I got your email at 11:23 AM and this one at 3:15PM.
I'll watch that bug and test the patch when it is available too.
Best,
Rob

On Fri, Aug 4, 2017 at 3:15 PM, Rohan Garg  wrote:

> Hi Rob,
>
> For some reason, the DMTCP forum on SourceForge is acting weird;
> it rejected my last message. So I'm not sure if you received my
> last e-mail.
>
> We are looking into this actively. Jiajun has also opened an issue
> on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which
> we believe is the reason for the bug that you are seeing. Hopefully, we
> should have a fix soon.
>
> -Rohan
>
> On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote:
> > Hi Rob,
> >
> >
> > I was discussing this with Jiajun, and it seems like we might be able to
> > write a simple test program to simulate this bug. This will definitely
> > require a fix in DMTCP. Hopefully, we can provide a patch to you soon.
> >
> > I'll try to work on it this weekend. We'll see if a screen sharing
> > session is necessary.
> >
> > Thanks,
> > Rohan
> >
> > On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote:
> > > Thanks Jiajun & Rohan!
> > >
> > > So in that trace before threads 2 through 41 are all identical, so
> almost
> > > all the threads are that nanosleep.  Since this is third party
> software, my
> > > ability to change the environment and code is quite limited and python
> 3.4
> > > is the only version that I was able to get working in this environment.
> > >
> > > Jiajun's explanation sounds like the issue to me.  What would be the
> next
> > > steps to address it?  Would we need to patch dmtcp? openblas?
> > >
> > > I'd be happy to arrange a screen sharing / zoom / hangouts session
> with one
> > > or more of you.
> > >
> > > -Rob
> > >
> > > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao  wrote:
> > >
> > > > Hi Rob,
> > > >
> > > > I know what's going on here after browsing some blas
> > > > source code. As Rohan mentioned, thread 42 is behaving normally.
> > > > What looks more interesting is the calling stack of thread 1, the
> > > > main thread, especially the bottom 5 frames. It seems that the blas
> > > > library registers some handler at fork, using pthread_atfork():
> > > >
> > > > https://github.com/xianyi/OpenBLAS/blob/
> e70a6b92bf9f69f113e198e3ded558
> > > > ed0675b038/driver/others/memory.c
> > > >
> > > > The handler tries to join a bunch of other blas threads, in this
> case,
> > > > including thread 41.
> > > >
> > > > The problem is, DMTCP has its internal synchronization semantics.
> > > > When a thread exits, it tries to grab a lock to preventing a
> checkpoint
> > > > from happening: it is an internal critical section. On the other
> hand,
> > > > when the application calls fork(), it also grabs the lock before
> fork()
> > > > returns: another internal critical section. That's why we see a
> deadlock
> > > > here:
> > > >
> > > > thread 1: lock acquired, and waiting for thread 41 to finish
> > > > thread 41: trying to acquire the lock before it exits
> > > >
> > > > Best,
> > > > Jiajun
> > > >
> > > > On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> > > > > Hi Rob,
> > > > >
> > > > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
> > > > > user process that's launched using DMTCP. It seems to be simply
> > > > > waiting for a message from the coordinator, which is the normal
> > > > > state of things. It's not stuck or the cause of the deadlock here.
> > > > >
> > > > > The deadlock is definitely because of one or more user threads. It
> > > > > seems like the program also forks and execs some child processes.
> > > > > We have seen some corner cases around that but it's difficult to
> > > > > say what it is with the information below.
> > > > >
> > > > > The fact that even `--disable-all-plugins` didn't work is quite
> > > > > interesting.
> > > > >
> > > > > I have a few other suggestions:
> > > > >
> > > > >  - Is it possible to share perhaps the top 20-25 frames from each
> of
> > > > >the threads? (You can also send them to us/me directly at:
> > > > >dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> > > > >
> > > > >Also, after you save the backtraces from all the threads, detach
> > > > >from gdb, wait for a few seconds, re-attach, and save a second
> > > > >set of backtraces. It'll help us figure out which threads are
> > > > >really stuck.
> > > > >
> > > > >  - Is it possible to run with other Python versions (not 3.4) as a
> > > > >quick test?
> > > > >
> > > > >  - What Numpy/OpenBLAS versions are you using? I'll try to run
> > > > >a few experiments locally.
> > > > >
> > > > >  - Is it possible try to cut down your Python script to the bare
> > > > >minimum, to isolate the issue? Perhaps just with a few imports
> > > > >and then add code incrementally. But if it turns out to be more
> > > > >work than its worth, then we should skip this suggestion.
> > > > >
> > > > >  - Finally, if it's possib

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-04 Thread Rohan Garg
Hi Rob,

For some reason, the DMTCP forum on SourceForge is acting weird;
it rejected my last message. So I'm not sure if you received my
last e-mail.

We are looking into this actively. Jiajun has also opened an issue
on Github (see: https://github.com/dmtcp/dmtcp/issues/612), which
we believe is the reason for the bug that you are seeing. Hopefully, we
should have a fix soon.

-Rohan

On Fri, Aug 04, 2017 at 02:23:43PM -0400, Rohan Garg wrote:
> Hi Rob,
> 
> 
> I was discussing this with Jiajun, and it seems like we might be able to
> write a simple test program to simulate this bug. This will definitely
> require a fix in DMTCP. Hopefully, we can provide a patch to you soon.
> 
> I'll try to work on it this weekend. We'll see if a screen sharing
> session is necessary.
> 
> Thanks,
> Rohan
> 
> On Tue, Aug 01, 2017 at 11:00:50PM -0700, Rob Egan wrote:
> > Thanks Jiajun & Rohan!
> > 
> > So in that trace before threads 2 through 41 are all identical, so almost
> > all the threads are that nanosleep.  Since this is third party software, my
> > ability to change the environment and code is quite limited and python 3.4
> > is the only version that I was able to get working in this environment.
> > 
> > Jiajun's explanation sounds like the issue to me.  What would be the next
> > steps to address it?  Would we need to patch dmtcp? openblas?
> > 
> > I'd be happy to arrange a screen sharing / zoom / hangouts session with one
> > or more of you.
> > 
> > -Rob
> > 
> > On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao  wrote:
> > 
> > > Hi Rob,
> > >
> > > I know what's going on here after browsing some blas
> > > source code. As Rohan mentioned, thread 42 is behaving normally.
> > > What looks more interesting is the calling stack of thread 1, the
> > > main thread, especially the bottom 5 frames. It seems that the blas
> > > library registers some handler at fork, using pthread_atfork():
> > >
> > > https://github.com/xianyi/OpenBLAS/blob/e70a6b92bf9f69f113e198e3ded558
> > > ed0675b038/driver/others/memory.c
> > >
> > > The handler tries to join a bunch of other blas threads, in this case,
> > > including thread 41.
> > >
> > > The problem is, DMTCP has its internal synchronization semantics.
> > > When a thread exits, it tries to grab a lock to preventing a checkpoint
> > > from happening: it is an internal critical section. On the other hand,
> > > when the application calls fork(), it also grabs the lock before fork()
> > > returns: another internal critical section. That's why we see a deadlock
> > > here:
> > >
> > > thread 1: lock acquired, and waiting for thread 41 to finish
> > > thread 41: trying to acquire the lock before it exits
> > >
> > > Best,
> > > Jiajun
> > >
> > > On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> > > > Hi Rob,
> > > >
> > > > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
> > > > user process that's launched using DMTCP. It seems to be simply
> > > > waiting for a message from the coordinator, which is the normal
> > > > state of things. It's not stuck or the cause of the deadlock here.
> > > >
> > > > The deadlock is definitely because of one or more user threads. It
> > > > seems like the program also forks and execs some child processes.
> > > > We have seen some corner cases around that but it's difficult to
> > > > say what it is with the information below.
> > > >
> > > > The fact that even `--disable-all-plugins` didn't work is quite
> > > > interesting.
> > > >
> > > > I have a few other suggestions:
> > > >
> > > >  - Is it possible to share perhaps the top 20-25 frames from each of
> > > >the threads? (You can also send them to us/me directly at:
> > > >dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> > > >
> > > >Also, after you save the backtraces from all the threads, detach
> > > >from gdb, wait for a few seconds, re-attach, and save a second
> > > >set of backtraces. It'll help us figure out which threads are
> > > >really stuck.
> > > >
> > > >  - Is it possible to run with other Python versions (not 3.4) as a
> > > >quick test?
> > > >
> > > >  - What Numpy/OpenBLAS versions are you using? I'll try to run
> > > >a few experiments locally.
> > > >
> > > >  - Is it possible try to cut down your Python script to the bare
> > > >minimum, to isolate the issue? Perhaps just with a few imports
> > > >and then add code incrementally. But if it turns out to be more
> > > >work than its worth, then we should skip this suggestion.
> > > >
> > > >  - Finally, if it's possible, we could arrange for a screen sharing
> > > >session. I think it'll be more efficient for interactively
> > > >debugging this.
> > > >
> > > > -Rohan
> > > >
> > > > On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote:
> > > > > Hi Rohan,
> > > > >
> > > > > Thanks for the quick reply.  So using version 2.4.8 and the
> > > > > --disable-alloc-plugin and --disable-dl-plugin the behavior is the
> > > same.  I
> > > > > also 

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-01 Thread Rob Egan
Thanks Jiajun & Rohan!

So in that trace before threads 2 through 41 are all identical, so almost
all the threads are that nanosleep.  Since this is third party software, my
ability to change the environment and code is quite limited and python 3.4
is the only version that I was able to get working in this environment.

Jiajun's explanation sounds like the issue to me.  What would be the next
steps to address it?  Would we need to patch dmtcp? openblas?

I'd be happy to arrange a screen sharing / zoom / hangouts session with one
or more of you.

-Rob

On Tue, Aug 1, 2017 at 9:37 AM, Jiajun Cao  wrote:

> Hi Rob,
>
> I know what's going on here after browsing some blas
> source code. As Rohan mentioned, thread 42 is behaving normally.
> What looks more interesting is the calling stack of thread 1, the
> main thread, especially the bottom 5 frames. It seems that the blas
> library registers some handler at fork, using pthread_atfork():
>
> https://github.com/xianyi/OpenBLAS/blob/e70a6b92bf9f69f113e198e3ded558
> ed0675b038/driver/others/memory.c
>
> The handler tries to join a bunch of other blas threads, in this case,
> including thread 41.
>
> The problem is, DMTCP has its internal synchronization semantics.
> When a thread exits, it tries to grab a lock to preventing a checkpoint
> from happening: it is an internal critical section. On the other hand,
> when the application calls fork(), it also grabs the lock before fork()
> returns: another internal critical section. That's why we see a deadlock
> here:
>
> thread 1: lock acquired, and waiting for thread 41 to finish
> thread 41: trying to acquire the lock before it exits
>
> Best,
> Jiajun
>
> On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> > Hi Rob,
> >
> > Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
> > user process that's launched using DMTCP. It seems to be simply
> > waiting for a message from the coordinator, which is the normal
> > state of things. It's not stuck or the cause of the deadlock here.
> >
> > The deadlock is definitely because of one or more user threads. It
> > seems like the program also forks and execs some child processes.
> > We have seen some corner cases around that but it's difficult to
> > say what it is with the information below.
> >
> > The fact that even `--disable-all-plugins` didn't work is quite
> > interesting.
> >
> > I have a few other suggestions:
> >
> >  - Is it possible to share perhaps the top 20-25 frames from each of
> >the threads? (You can also send them to us/me directly at:
> >dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> >
> >Also, after you save the backtraces from all the threads, detach
> >from gdb, wait for a few seconds, re-attach, and save a second
> >set of backtraces. It'll help us figure out which threads are
> >really stuck.
> >
> >  - Is it possible to run with other Python versions (not 3.4) as a
> >quick test?
> >
> >  - What Numpy/OpenBLAS versions are you using? I'll try to run
> >a few experiments locally.
> >
> >  - Is it possible try to cut down your Python script to the bare
> >minimum, to isolate the issue? Perhaps just with a few imports
> >and then add code incrementally. But if it turns out to be more
> >work than its worth, then we should skip this suggestion.
> >
> >  - Finally, if it's possible, we could arrange for a screen sharing
> >session. I think it'll be more efficient for interactively
> >debugging this.
> >
> > -Rohan
> >
> > On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote:
> > > Hi Rohan,
> > >
> > > Thanks for the quick reply.  So using version 2.4.8 and the
> > > --disable-alloc-plugin and --disable-dl-plugin the behavior is the
> same.  I
> > > also tried --disable-all-plugins and it still did not work, so I think
> this
> > > may be different from the numpy bug you identified, though as you will
> will
> > > see the stack trace does show one of the threads in the numpy /
> openblas
> > > library.
> > >
> > > Though we are running at NERSC, I am not running this on the cray
> systems
> > > (yet) and not through slurm (yet).
> > >
> > > When I attach to the running process, unfortunately there are a ton of
> > > threads and a deep python stack.  Apparently even though I only
> requested 1
> > > thread, the python library allocates a ton of them.  I tried to remove
> > > duplicates manually for you.  It looks like thread 42 and 1 are the
> > > interesting ones.  2-41 are all the same.  It looks to me like
> > > dmtcp_coordinator is interfering with the blas library's pthread_join
> calls.
> > >
> > >
> > >
> > >
> > > Thread 42 (Thread 0x7f266738e700 (LWP 14136)):
> > > #0  0x7f26690ed14d in read () from /lib/libpthread.so.0
> > > #1  0x7f2669804224 in _real_read (fd=,
> buf= > > out>, count=) at syscallsreal.c:504
> > > #2  0x7f26697eae80 in dmtcp::Util::readAll (fd=821,
> buf=0x7f266738bc00,
> > > count=136) at util_misc.cpp:279
> > > #3  0x7f26697c303b in 

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-08-01 Thread Jiajun Cao
Hi Rob,

I know what's going on here after browsing some blas
source code. As Rohan mentioned, thread 42 is behaving normally.
What looks more interesting is the calling stack of thread 1, the
main thread, especially the bottom 5 frames. It seems that the blas
library registers some handler at fork, using pthread_atfork():

https://github.com/xianyi/OpenBLAS/blob/e70a6b92bf9f69f113e198e3ded558ed0675b038/driver/others/memory.c

The handler tries to join a bunch of other blas threads, in this case,
including thread 41.

The problem is, DMTCP has its internal synchronization semantics.
When a thread exits, it tries to grab a lock to preventing a checkpoint
from happening: it is an internal critical section. On the other hand,
when the application calls fork(), it also grabs the lock before fork()
returns: another internal critical section. That's why we see a deadlock
here:

thread 1: lock acquired, and waiting for thread 41 to finish
thread 41: trying to acquire the lock before it exits

Best,
Jiajun

On Mon, Jul 31, 2017 at 10:15:15PM -0700, Rohan Garg wrote:
> Hi Rob,
> 
> Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
> user process that's launched using DMTCP. It seems to be simply
> waiting for a message from the coordinator, which is the normal
> state of things. It's not stuck or the cause of the deadlock here.
> 
> The deadlock is definitely because of one or more user threads. It
> seems like the program also forks and execs some child processes.
> We have seen some corner cases around that but it's difficult to
> say what it is with the information below.
> 
> The fact that even `--disable-all-plugins` didn't work is quite
> interesting.
> 
> I have a few other suggestions:
> 
>  - Is it possible to share perhaps the top 20-25 frames from each of
>the threads? (You can also send them to us/me directly at:
>dm...@ccs.neu.edu/rohg...@ccs.neu.edu)
> 
>Also, after you save the backtraces from all the threads, detach
>from gdb, wait for a few seconds, re-attach, and save a second
>set of backtraces. It'll help us figure out which threads are
>really stuck.
> 
>  - Is it possible to run with other Python versions (not 3.4) as a
>quick test?
> 
>  - What Numpy/OpenBLAS versions are you using? I'll try to run
>a few experiments locally.
> 
>  - Is it possible try to cut down your Python script to the bare
>minimum, to isolate the issue? Perhaps just with a few imports
>and then add code incrementally. But if it turns out to be more
>work than its worth, then we should skip this suggestion.
> 
>  - Finally, if it's possible, we could arrange for a screen sharing
>session. I think it'll be more efficient for interactively
>debugging this.
> 
> -Rohan
> 
> On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote:
> > Hi Rohan,
> > 
> > Thanks for the quick reply.  So using version 2.4.8 and the
> > --disable-alloc-plugin and --disable-dl-plugin the behavior is the same.  I
> > also tried --disable-all-plugins and it still did not work, so I think this
> > may be different from the numpy bug you identified, though as you will will
> > see the stack trace does show one of the threads in the numpy / openblas
> > library.
> > 
> > Though we are running at NERSC, I am not running this on the cray systems
> > (yet) and not through slurm (yet).
> > 
> > When I attach to the running process, unfortunately there are a ton of
> > threads and a deep python stack.  Apparently even though I only requested 1
> > thread, the python library allocates a ton of them.  I tried to remove
> > duplicates manually for you.  It looks like thread 42 and 1 are the
> > interesting ones.  2-41 are all the same.  It looks to me like
> > dmtcp_coordinator is interfering with the blas library's pthread_join calls.
> > 
> > 
> > 
> > 
> > Thread 42 (Thread 0x7f266738e700 (LWP 14136)):
> > #0  0x7f26690ed14d in read () from /lib/libpthread.so.0
> > #1  0x7f2669804224 in _real_read (fd=, buf= > out>, count=) at syscallsreal.c:504
> > #2  0x7f26697eae80 in dmtcp::Util::readAll (fd=821, buf=0x7f266738bc00,
> > count=136) at util_misc.cpp:279
> > #3  0x7f26697c303b in operator>> (t=...,
> > this=0x7f2669c33408) at ../jalib/jsocket.h:109
> > #4  dmtcp::CoordinatorAPI::recvMsgFromCoordinator (this=0x7f2669c33408,
> > msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422
> > #5  0x7f26697bb9fc in dmtcp::DmtcpWorker::waitForCoordinatorMsg
> > (msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at dmtcpworker.cpp:469
> > #6  0x7f26697bc58b in dmtcp::DmtcpWorker::waitForStage1Suspend () at
> > dmtcpworker.cpp:513
> > #7  0x7f26697ccd1a in dmtcp::callbackSleepBetweenCheckpoint
> > (sec=) at mtcpinterface.cpp:61
> > #8  0x7f26697d85e9 in checkpointhread (dummy=) at
> > threadlist.cpp:359
> > #9  0x7f26697cfb9e in pthread_start (arg=) at
> > threadwrappers.cpp:159
> > #10 0x7f26690e58ca in start_thread () from /lib/libpthread.so.0
> > #11 0x7f

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-07-31 Thread Rohan Garg
Hi Rob,

Thread 42 is DMTCP's checkpoint thread. DMTCP injects that in every
user process that's launched using DMTCP. It seems to be simply
waiting for a message from the coordinator, which is the normal
state of things. It's not stuck or the cause of the deadlock here.

The deadlock is definitely because of one or more user threads. It
seems like the program also forks and execs some child processes.
We have seen some corner cases around that but it's difficult to
say what it is with the information below.

The fact that even `--disable-all-plugins` didn't work is quite
interesting.

I have a few other suggestions:

 - Is it possible to share perhaps the top 20-25 frames from each of
   the threads? (You can also send them to us/me directly at:
   dm...@ccs.neu.edu/rohg...@ccs.neu.edu)

   Also, after you save the backtraces from all the threads, detach
   from gdb, wait for a few seconds, re-attach, and save a second
   set of backtraces. It'll help us figure out which threads are
   really stuck.

 - Is it possible to run with other Python versions (not 3.4) as a
   quick test?

 - What Numpy/OpenBLAS versions are you using? I'll try to run
   a few experiments locally.

 - Is it possible try to cut down your Python script to the bare
   minimum, to isolate the issue? Perhaps just with a few imports
   and then add code incrementally. But if it turns out to be more
   work than its worth, then we should skip this suggestion.

 - Finally, if it's possible, we could arrange for a screen sharing
   session. I think it'll be more efficient for interactively
   debugging this.

-Rohan

On Mon, Jul 31, 2017 at 03:30:55PM -0700, Rob Egan wrote:
> Hi Rohan,
> 
> Thanks for the quick reply.  So using version 2.4.8 and the
> --disable-alloc-plugin and --disable-dl-plugin the behavior is the same.  I
> also tried --disable-all-plugins and it still did not work, so I think this
> may be different from the numpy bug you identified, though as you will will
> see the stack trace does show one of the threads in the numpy / openblas
> library.
> 
> Though we are running at NERSC, I am not running this on the cray systems
> (yet) and not through slurm (yet).
> 
> When I attach to the running process, unfortunately there are a ton of
> threads and a deep python stack.  Apparently even though I only requested 1
> thread, the python library allocates a ton of them.  I tried to remove
> duplicates manually for you.  It looks like thread 42 and 1 are the
> interesting ones.  2-41 are all the same.  It looks to me like
> dmtcp_coordinator is interfering with the blas library's pthread_join calls.
> 
> 
> 
> 
> Thread 42 (Thread 0x7f266738e700 (LWP 14136)):
> #0  0x7f26690ed14d in read () from /lib/libpthread.so.0
> #1  0x7f2669804224 in _real_read (fd=, buf= out>, count=) at syscallsreal.c:504
> #2  0x7f26697eae80 in dmtcp::Util::readAll (fd=821, buf=0x7f266738bc00,
> count=136) at util_misc.cpp:279
> #3  0x7f26697c303b in operator>> (t=...,
> this=0x7f2669c33408) at ../jalib/jsocket.h:109
> #4  dmtcp::CoordinatorAPI::recvMsgFromCoordinator (this=0x7f2669c33408,
> msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422
> #5  0x7f26697bb9fc in dmtcp::DmtcpWorker::waitForCoordinatorMsg
> (msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at dmtcpworker.cpp:469
> #6  0x7f26697bc58b in dmtcp::DmtcpWorker::waitForStage1Suspend () at
> dmtcpworker.cpp:513
> #7  0x7f26697ccd1a in dmtcp::callbackSleepBetweenCheckpoint
> (sec=) at mtcpinterface.cpp:61
> #8  0x7f26697d85e9 in checkpointhread (dummy=) at
> threadlist.cpp:359
> #9  0x7f26697cfb9e in pthread_start (arg=) at
> threadwrappers.cpp:159
> #10 0x7f26690e58ca in start_thread () from /lib/libpthread.so.0
> #11 0x7f26697cf9c9 in clone_start (arg=0x7f2669c15008) at
> threadwrappers.cpp:68
> #12 0x7f26687c3abd in clone () from /lib/libc.so.6
> #13 0x in ?? ()
> 
> Thread 41 (Thread 0x7f2660f99700 (LWP 14137)):
> #0  0x7f26690ed86d in nanosleep () from /lib/libpthread.so.0
> #1  0x7f26697bf599 in dmtcp::ThreadSync::wrapperExecutionLockLock () at
> threadsync.cpp:392
> #2  0x7f26697cfba6 in pthread_start (arg=) at
> threadwrappers.cpp:161
> #3  0x7f26690e58ca in start_thread () from /lib/libpthread.so.0
> #4  0x7f26697cf9c9 in clone_start (arg=0x7f2669c24008) at
> threadwrappers.cpp:68
> #5  0x7f26687c3abd in clone () from /lib/libc.so.6
> #6  0x in ?? ()
> 
> Thread 40 (Thread 0x7f2660798700 (LWP 14138)):
> ...
> 
> Thread 1 (Thread 0x7f2669c26740 (LWP 14131)):
> #0  0x7f26690ece69 in __lll_timedwait_tid () from /lib/libpthread.so.0
> #1  0x7f26690e6ebf in pthread_timedjoin_np () from /lib/libpthread.so.0
> #2  0x7f26697d0048 in pthread_join (thread=139802812454656, retval=0x0)
> at threadwrappers.cpp:274
> #3  0x7f26615cd2d7 in blas_thread_shutdown_ () from
> /global/homes/r/regan/.local/lib/python3.4/site-packages/numpy/core/../.libs/
> libopenblasp-r0-39a31c03.2.18.so

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-07-31 Thread Rob Egan
Hi Rohan,

Thanks for the quick reply.  So using version 2.4.8 and the
--disable-alloc-plugin and --disable-dl-plugin the behavior is the same.  I
also tried --disable-all-plugins and it still did not work, so I think this
may be different from the numpy bug you identified, though as you will will
see the stack trace does show one of the threads in the numpy / openblas
library.

Though we are running at NERSC, I am not running this on the cray systems
(yet) and not through slurm (yet).

When I attach to the running process, unfortunately there are a ton of
threads and a deep python stack.  Apparently even though I only requested 1
thread, the python library allocates a ton of them.  I tried to remove
duplicates manually for you.  It looks like thread 42 and 1 are the
interesting ones.  2-41 are all the same.  It looks to me like
dmtcp_coordinator is interfering with the blas library's pthread_join calls.




Thread 42 (Thread 0x7f266738e700 (LWP 14136)):
#0  0x7f26690ed14d in read () from /lib/libpthread.so.0
#1  0x7f2669804224 in _real_read (fd=, buf=, count=) at syscallsreal.c:504
#2  0x7f26697eae80 in dmtcp::Util::readAll (fd=821, buf=0x7f266738bc00,
count=136) at util_misc.cpp:279
#3  0x7f26697c303b in operator>> (t=...,
this=0x7f2669c33408) at ../jalib/jsocket.h:109
#4  dmtcp::CoordinatorAPI::recvMsgFromCoordinator (this=0x7f2669c33408,
msg=0x7f266738bc00, extraData=0x0) at coordinatorapi.cpp:422
#5  0x7f26697bb9fc in dmtcp::DmtcpWorker::waitForCoordinatorMsg
(msgStr=..., type=dmtcp::DMT_DO_SUSPEND) at dmtcpworker.cpp:469
#6  0x7f26697bc58b in dmtcp::DmtcpWorker::waitForStage1Suspend () at
dmtcpworker.cpp:513
#7  0x7f26697ccd1a in dmtcp::callbackSleepBetweenCheckpoint
(sec=) at mtcpinterface.cpp:61
#8  0x7f26697d85e9 in checkpointhread (dummy=) at
threadlist.cpp:359
#9  0x7f26697cfb9e in pthread_start (arg=) at
threadwrappers.cpp:159
#10 0x7f26690e58ca in start_thread () from /lib/libpthread.so.0
#11 0x7f26697cf9c9 in clone_start (arg=0x7f2669c15008) at
threadwrappers.cpp:68
#12 0x7f26687c3abd in clone () from /lib/libc.so.6
#13 0x in ?? ()

Thread 41 (Thread 0x7f2660f99700 (LWP 14137)):
#0  0x7f26690ed86d in nanosleep () from /lib/libpthread.so.0
#1  0x7f26697bf599 in dmtcp::ThreadSync::wrapperExecutionLockLock () at
threadsync.cpp:392
#2  0x7f26697cfba6 in pthread_start (arg=) at
threadwrappers.cpp:161
#3  0x7f26690e58ca in start_thread () from /lib/libpthread.so.0
#4  0x7f26697cf9c9 in clone_start (arg=0x7f2669c24008) at
threadwrappers.cpp:68
#5  0x7f26687c3abd in clone () from /lib/libc.so.6
#6  0x in ?? ()

Thread 40 (Thread 0x7f2660798700 (LWP 14138)):
...

Thread 1 (Thread 0x7f2669c26740 (LWP 14131)):
#0  0x7f26690ece69 in __lll_timedwait_tid () from /lib/libpthread.so.0
#1  0x7f26690e6ebf in pthread_timedjoin_np () from /lib/libpthread.so.0
#2  0x7f26697d0048 in pthread_join (thread=139802812454656, retval=0x0)
at threadwrappers.cpp:274
#3  0x7f26615cd2d7 in blas_thread_shutdown_ () from
/global/homes/r/regan/.local/lib/python3.4/site-packages/numpy/core/../.libs/
libopenblasp-r0-39a31c03.2.18.so
#4  0x7f26687936d5 in fork () from /lib/libc.so.6
#5  0x7f26697c9724 in fork () at execwrappers.cpp:186
#6  0x7f2665b9ed7f in subprocess_fork_exec (self=,
args=) at
/global/common/genepool/usg/languages/python/build/Python-3.4.3/Modules/_posixsubprocess.c:649
#7  0x7f2669427225 in call_function (oparg=,
pp_stack=0x7fff86d8d690) at Python/c
eval.c:4237
#8  PyEval_EvalFrameEx (f=, throwflag=) at
Python/ceval.c:2838
#9  0x7f2669428cec in PyEval_EvalCodeEx (_co=,
globals=, locals=, args=,
argcount=19, kws=0x7f25f4014ad8, kwcount=0, defs=0x0,
defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3588
#10 0x7f2669427ca4 in fast_function (nk=, na=, n=, pp_stack=0x7fff86d8d880, func=0x7f2669a60ea0) at
Python/ceval.c:4344
#11 call_function (oparg=, pp_stack=0x7fff86d8d880) at
Python/ceval.c:4262
#12 PyEval_EvalFrameEx (f=, throwflag=) at
Python/ceval.c:2838
#13 0x7f2669428cec in PyEval_EvalCodeEx (_co=,
globals=, locals=, args=,
argcount=2, kws=0x7f25fa0dda70, kwcount=4,
defs=0x7f2669a55110, defcount=16, kwdefs=0x0, closure=0x0) at
Python/ceval.c:3588
#14 0x7f266938a2bc in function_call (func=0x7f2669a60840,
arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at Objects/funcobject.c:632
#15 0x7f266935e8be in PyObject_Call (func=0x7f2669a60840,
arg=, kw=) at Objects/abstract.c:2040
#16 0x7f2669375354 in method_call (func=0x7f2669a60840,
arg=0x7f25f9f2c148, kw=0x7f25f9f46208) at Objects/classobject.c:347
#17 0x7f266935e8be in PyObject_Call (func=0x7f2663ff4148,
arg=, kw=) at Objects/abstract.c:2040
#18 0x7f26693c15f7 in slot_tp_init (self=0x7f25f9f45198,
args=0x7f25f9f45128, kwds=0x7f25f9f46208) at Objects/typeobject.c:6177
#19 0x7f26693bd17f in type_call (type=,
args=0x7f25f9f45128, kwds=0x7f25f9f46208) at Objects/typeobject.c:898
#20 0x7f266

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-07-29 Thread Rohan Garg
I just realized that there's an open issue related to Python and
Numpy.  See here: https://github.com/dmtcp/dmtcp/issues/599.
Unfortunately, I haven't been able to spend enough time to get a
fix out.

Could you please verify if this isn't what's going on? Even if it's
not the exact same libraries, I think the bug, as reported on the
Github issue, is more general and could manifest through different
libraries, although with a similar backtrace signature.

On Fri, Jul 28, 2017 at 09:53:40PM -0400, Rohan Garg wrote:
> Hi Rob,
> 
> Thanks for trying out DMTCP and reporting this issue. We'd be happy to
> help resolve this issue.
> 
> I'll take a look at the strace, but probably the most useful thing
> here would be to try to figure out the process that's hanging and
> to look at its backtrace. The backtrace from the parent could also
> be useful. You should be able to attach gdb to the interesting
> processes and look at the backtrace.
> 
> $ gdb attach 
> ...
> (gdb) thread apply all bt
> 
> This can hopefully tell us the reason for the deadlock. Also, as a quick
> test could you please try running your program with the following DMTCP
> options?
> 
> $ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin 
> 
> 
> Also, I'd recommend working with the 2.4.x/2.5.x versions; we are
> not maintaining the 1.x branch any longer.
> 
> (I do have a NERSC account and have recently diagnosed some issues
> with the CRAY toolchain/srun and DMTCP there. But it seems like
> that that's not an issue here.)
> 
> Thanks,
> Rohan
> 
> On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote:
> > Hello,
> > 
> > I am trying to adopt dmtcp at our institution and ran into a problem with
> > the first script that I tried.  The script seems to start fine and perform
> > a few operations, but then stalls indefinitely consuming 0 cpu in what
> > looks like a futex wait loop after the process was cloned.  It is a
> > multi-threaded program and the behavior is the same whether 1 or more
> > threads is requested.
> > 
> > This is a 3rd party python script, so unfortunately I can't share the code
> > but I uploaded the strace from a 1-threaded execution:
> > 
> > dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> > 
> > http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> > 
> > This is how I executed the dmctp & strace & script:
> > 
> > dmtcp_launch  strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/
> > -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308
> > 
> > 
> > And here is the output from the dmtcp_coordinator (which I commanded to
> > kill after a few minutes of inactivity):
> > 
> > 
> > regan@gpint200:/global/projectb/scratch/regan/nanopore/runs/X0124-onebatch2$
> > dmtcp_coordinator
> > dmtcp_coordinator starting...
> > Host: gpint200 (127.0.0.2)
> > Port: 7779
> > Checkpoint Interval: 18000
> > Exit on last client: 0
> > Type '?' for help.
> > 
> > [10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval;
> > REASON='CheckpointInterval updated (for this computation only)'
> >  oldInterval = 18000
> >  theCheckpointInterval = 18000
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >  hello_remote.from = 704891f6494a6304-10275-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> > process Information after exec()'
> >  progname = strace
> >  msg.from = 704891f6494a6304-4-597be341
> >  client->identity() = 704891f6494a6304-10275-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >  hello_remote.from = 704891f6494a6304-4-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
> > process Information after fork()'
> >  client->hostname() = gpint200
> >  client->progname() = strace_(forked)
> >  msg.from = 704891f6494a6304-41000-597be341
> >  client->identity() = 704891f6494a6304-4-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> > process Information after exec()'
> > [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> > connected'
> >  hello_remote.from = 704891f6494a6304-4-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
> > process Information after fork()'
> >  client->hostname() = gpint200
> >  client->progname() = strace_(forked)
> >  msg.from = 704891f6494a6304-41000-597be341
> >  client->identity() = 704891f6494a6304-4-597be341
> > [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> > process Information after exec()'
> >  progname = python3.4
> >  msg.from = 704891f6494a6304-41000-597be341
> >  client->identity() = 704891f6494a6304-41000-597be341
> > 

Re: [Dmtcp-forum] indefinite stall within a python pipeline

2017-07-28 Thread Rohan Garg
Hi Rob,

Thanks for trying out DMTCP and reporting this issue. We'd be happy to
help resolve this issue.

I'll take a look at the strace, but probably the most useful thing
here would be to try to figure out the process that's hanging and
to look at its backtrace. The backtrace from the parent could also
be useful. You should be able to attach gdb to the interesting
processes and look at the backtrace.

$ gdb attach 
...
(gdb) thread apply all bt

This can hopefully tell us the reason for the deadlock. Also, as a quick
test could you please try running your program with the following DMTCP
options?

$ dmtcp_launch --disable-alloc-plugin --disable-dl-plugin 


Also, I'd recommend working with the 2.4.x/2.5.x versions; we are
not maintaining the 1.x branch any longer.

(I do have a NERSC account and have recently diagnosed some issues
with the CRAY toolchain/srun and DMTCP there. But it seems like
that that's not an issue here.)

Thanks,
Rohan

On Fri, Jul 28, 2017 at 06:28:03PM -0700, Rob Egan wrote:
> Hello,
> 
> I am trying to adopt dmtcp at our institution and ran into a problem with
> the first script that I tried.  The script seems to start fine and perform
> a few operations, but then stalls indefinitely consuming 0 cpu in what
> looks like a futex wait loop after the process was cloned.  It is a
> multi-threaded program and the behavior is the same whether 1 or more
> threads is requested.
> 
> This is a 3rd party python script, so unfortunately I can't share the code
> but I uploaded the strace from a 1-threaded execution:
> 
> dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> 
> http://portal.nersc.gov/dna/RD/Adv-Seq/ONT/dmtcp-stall-read_fast5_basecaller.py-strace.log.gz
> 
> This is how I executed the dmctp & strace & script:
> 
> dmtcp_launch  strace ~/.local/bin/read_fast5_basecaller.py -i uploaded/0/
> -o fastq,fast5 -s basecalls6 -t 1 --flowcell FLO-MIN107 --kit SQK-LSK308
> 
> 
> And here is the output from the dmtcp_coordinator (which I commanded to
> kill after a few minutes of inactivity):
> 
> 
> regan@gpint200:/global/projectb/scratch/regan/nanopore/runs/X0124-onebatch2$
> dmtcp_coordinator
> dmtcp_coordinator starting...
> Host: gpint200 (127.0.0.2)
> Port: 7779
> Checkpoint Interval: 18000
> Exit on last client: 0
> Type '?' for help.
> 
> [10273] NOTE at dmtcp_coordinator.cpp:1675 in updateCheckpointInterval;
> REASON='CheckpointInterval updated (for this computation only)'
>  oldInterval = 18000
>  theCheckpointInterval = 18000
> [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> connected'
>  hello_remote.from = 704891f6494a6304-10275-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> process Information after exec()'
>  progname = strace
>  msg.from = 704891f6494a6304-4-597be341
>  client->identity() = 704891f6494a6304-10275-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> connected'
>  hello_remote.from = 704891f6494a6304-4-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
> process Information after fork()'
>  client->hostname() = gpint200
>  client->progname() = strace_(forked)
>  msg.from = 704891f6494a6304-41000-597be341
>  client->identity() = 704891f6494a6304-4-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> process Information after exec()'
> [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> connected'
>  hello_remote.from = 704891f6494a6304-4-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:860 in onData; REASON='Updating
> process Information after fork()'
>  client->hostname() = gpint200
>  client->progname() = strace_(forked)
>  msg.from = 704891f6494a6304-41000-597be341
>  client->identity() = 704891f6494a6304-4-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:869 in onData; REASON='Updating
> process Information after exec()'
>  progname = python3.4
>  msg.from = 704891f6494a6304-41000-597be341
>  client->identity() = 704891f6494a6304-41000-597be341
> [10273] NOTE at dmtcp_coordinator.cpp:1081 in onConnect; REASON='worker
> connected'
>  hello_remote.from = 704891f6494a6304-41000-597be341
> k
> [10273] NOTE at dmtcp_coordinator.cpp:603 in handleUserCommand;
> REASON='Killing all connected Peers...'
> [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
> disconnected'
>  client->identity() = 704891f6494a6304-4-597be341
>  client->progname() = strace
> [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
> disconnected'
>  client->identity() = 704891f6494a6304-41000-597be341
>  client->progname() = python3.4_(forked)
> [10273] NOTE at dmtcp_coordinator.cpp:919 in onDisconnect; REASON='client
> disconnecte