Re: [gmx-users] REMD stall out

2020-02-21 Thread Daniel Burns
This was not actually the solution.  Wanted to follow up in case
someone else is experiencing this problem.  We are reinstalling the openmp
version.

On Thu, Feb 20, 2020 at 3:10 PM Daniel Burns  wrote:

> Hi again,
>
> It seems including our openmp module was responsible for the issue the
> whole time.  When I submit the job only loading pmix and gromacs, replica
> exchange proceeds.
>
> Thank you,
>
> Dan
>
> On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham 
> wrote:
>
>> Hi,
>>
>> That could be caused by configuration of the parallel file system or MPI
>> on
>> your cluster. If only one file descriptor is available per node to an MPI
>> job, then your symptoms are explained. Some kinds of compute jobs follow
>> such a model, so maybe someone optimized something for that.
>>
>> Mark
>>
>> On Mon, 17 Feb 2020 at 15:56, Daniel Burns  wrote:
>>
>> > HI Szilard,
>> >
>> > I've deleted all my output but all the writing to the log and console
>> stops
>> > around the step noting the domain decomposition (or other preliminary
>> > task).  It is the same with or without Plumed - the TREMD with Gromacs
>> only
>> > was the first thing to present this issue.
>> >
>> > I've discovered that if each replica is assigned its own node, the
>> > simulations proceed.  If I try to run several replicas on each node
>> > (divided evenly), the simulations stall out before any trajectories get
>> > written.
>> >
>> > I have tried many different -np and -ntomp options as well as several
>> slurm
>> > job submission scripts with node/ thread configurations but multiple
>> > simulations per node will not work.  I need to be able to run several
>> > replicas on the same node to get enough data since it's hard to get more
>> > than 8 nodes (and as a result, replicas).
>> >
>> > Thanks for your reply.
>> >
>> > -Dan
>> >
>> > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns 
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > I continue to have trouble getting an REMD job to run.  It never
>> makes it
>> > > to the point that it generates trajectory files but it never gives any
>> > > error either.
>> > >
>> > > I have switched from a large TREMD with 72 replicas to the Plumed
>> > > Hamiltonian method with only 6 replicas.  Everything is now on one
>> node
>> > and
>> > > each replica has 6 cores.  I've turned off the dynamic load balancing
>> on
>> > > this attempt per the recommendation from the Plumed site.
>> > >
>> > > Any ideas on how to troubleshoot?
>> > >
>> > > Thank you,
>> > >
>> > > Dan
>> > >
>> > --
>> > Gromacs Users mailing list
>> >
>> > * Please search the archive at
>> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> > posting!
>> >
>> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>> >
>> > * For (un)subscribe requests visit
>> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> > send a mail to gmx-users-requ...@gromacs.org.
>> >
>> --
>> Gromacs Users mailing list
>>
>> * Please search the archive at
>> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
>> posting!
>>
>> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>>
>> * For (un)subscribe requests visit
>> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
>> send a mail to gmx-users-requ...@gromacs.org.
>>
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] REMD stall out

2020-02-20 Thread Daniel Burns
Hi again,

It seems including our openmp module was responsible for the issue the
whole time.  When I submit the job only loading pmix and gromacs, replica
exchange proceeds.

Thank you,

Dan

On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham 
wrote:

> Hi,
>
> That could be caused by configuration of the parallel file system or MPI on
> your cluster. If only one file descriptor is available per node to an MPI
> job, then your symptoms are explained. Some kinds of compute jobs follow
> such a model, so maybe someone optimized something for that.
>
> Mark
>
> On Mon, 17 Feb 2020 at 15:56, Daniel Burns  wrote:
>
> > HI Szilard,
> >
> > I've deleted all my output but all the writing to the log and console
> stops
> > around the step noting the domain decomposition (or other preliminary
> > task).  It is the same with or without Plumed - the TREMD with Gromacs
> only
> > was the first thing to present this issue.
> >
> > I've discovered that if each replica is assigned its own node, the
> > simulations proceed.  If I try to run several replicas on each node
> > (divided evenly), the simulations stall out before any trajectories get
> > written.
> >
> > I have tried many different -np and -ntomp options as well as several
> slurm
> > job submission scripts with node/ thread configurations but multiple
> > simulations per node will not work.  I need to be able to run several
> > replicas on the same node to get enough data since it's hard to get more
> > than 8 nodes (and as a result, replicas).
> >
> > Thanks for your reply.
> >
> > -Dan
> >
> > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns 
> wrote:
> >
> > > Hi,
> > >
> > > I continue to have trouble getting an REMD job to run.  It never makes
> it
> > > to the point that it generates trajectory files but it never gives any
> > > error either.
> > >
> > > I have switched from a large TREMD with 72 replicas to the Plumed
> > > Hamiltonian method with only 6 replicas.  Everything is now on one node
> > and
> > > each replica has 6 cores.  I've turned off the dynamic load balancing
> on
> > > this attempt per the recommendation from the Plumed site.
> > >
> > > Any ideas on how to troubleshoot?
> > >
> > > Thank you,
> > >
> > > Dan
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] REMD stall out

2020-02-17 Thread Daniel Burns
Thanks Mark and Szilard,

I forwarded Mark's suggestion to IT.  I'll see what they have to say and
then I'll try the simulation again and open an issue on redime.

Thank you,

Dan

On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham 
wrote:

> Hi,
>
> That could be caused by configuration of the parallel file system or MPI on
> your cluster. If only one file descriptor is available per node to an MPI
> job, then your symptoms are explained. Some kinds of compute jobs follow
> such a model, so maybe someone optimized something for that.
>
> Mark
>
> On Mon, 17 Feb 2020 at 15:56, Daniel Burns  wrote:
>
> > HI Szilard,
> >
> > I've deleted all my output but all the writing to the log and console
> stops
> > around the step noting the domain decomposition (or other preliminary
> > task).  It is the same with or without Plumed - the TREMD with Gromacs
> only
> > was the first thing to present this issue.
> >
> > I've discovered that if each replica is assigned its own node, the
> > simulations proceed.  If I try to run several replicas on each node
> > (divided evenly), the simulations stall out before any trajectories get
> > written.
> >
> > I have tried many different -np and -ntomp options as well as several
> slurm
> > job submission scripts with node/ thread configurations but multiple
> > simulations per node will not work.  I need to be able to run several
> > replicas on the same node to get enough data since it's hard to get more
> > than 8 nodes (and as a result, replicas).
> >
> > Thanks for your reply.
> >
> > -Dan
> >
> > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns 
> wrote:
> >
> > > Hi,
> > >
> > > I continue to have trouble getting an REMD job to run.  It never makes
> it
> > > to the point that it generates trajectory files but it never gives any
> > > error either.
> > >
> > > I have switched from a large TREMD with 72 replicas to the Plumed
> > > Hamiltonian method with only 6 replicas.  Everything is now on one node
> > and
> > > each replica has 6 cores.  I've turned off the dynamic load balancing
> on
> > > this attempt per the recommendation from the Plumed site.
> > >
> > > Any ideas on how to troubleshoot?
> > >
> > > Thank you,
> > >
> > > Dan
> > >
> > --
> > Gromacs Users mailing list
> >
> > * Please search the archive at
> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> > posting!
> >
> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
> >
> > * For (un)subscribe requests visit
> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> > send a mail to gmx-users-requ...@gromacs.org.
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] REMD stall out

2020-02-17 Thread Mark Abraham
Hi,

That could be caused by configuration of the parallel file system or MPI on
your cluster. If only one file descriptor is available per node to an MPI
job, then your symptoms are explained. Some kinds of compute jobs follow
such a model, so maybe someone optimized something for that.

Mark

On Mon, 17 Feb 2020 at 15:56, Daniel Burns  wrote:

> HI Szilard,
>
> I've deleted all my output but all the writing to the log and console stops
> around the step noting the domain decomposition (or other preliminary
> task).  It is the same with or without Plumed - the TREMD with Gromacs only
> was the first thing to present this issue.
>
> I've discovered that if each replica is assigned its own node, the
> simulations proceed.  If I try to run several replicas on each node
> (divided evenly), the simulations stall out before any trajectories get
> written.
>
> I have tried many different -np and -ntomp options as well as several slurm
> job submission scripts with node/ thread configurations but multiple
> simulations per node will not work.  I need to be able to run several
> replicas on the same node to get enough data since it's hard to get more
> than 8 nodes (and as a result, replicas).
>
> Thanks for your reply.
>
> -Dan
>
> On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns  wrote:
>
> > Hi,
> >
> > I continue to have trouble getting an REMD job to run.  It never makes it
> > to the point that it generates trajectory files but it never gives any
> > error either.
> >
> > I have switched from a large TREMD with 72 replicas to the Plumed
> > Hamiltonian method with only 6 replicas.  Everything is now on one node
> and
> > each replica has 6 cores.  I've turned off the dynamic load balancing on
> > this attempt per the recommendation from the Plumed site.
> >
> > Any ideas on how to troubleshoot?
> >
> > Thank you,
> >
> > Dan
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] REMD stall out

2020-02-17 Thread Szilárd Páll
Hi Dan,

What you describe in not an expected behaviro and it is something we should
look into.

What GROMACS version were you using? One thing that may help diagnosing the
issue is: try to disable replica exchange and run -multidir that way. Does
the simulation proceed?

Can you please open an issue on redmine.gromacs.org and upload the
necessary input files to reproduce, logs of your runs that reproduced the
issue.

Cheers,
--
Szilárd


On Mon, Feb 17, 2020 at 3:56 PM Daniel Burns  wrote:

> HI Szilard,
>
> I've deleted all my output but all the writing to the log and console stops
> around the step noting the domain decomposition (or other preliminary
> task).  It is the same with or without Plumed - the TREMD with Gromacs only
> was the first thing to present this issue.
>
> I've discovered that if each replica is assigned its own node, the
> simulations proceed.  If I try to run several replicas on each node
> (divided evenly), the simulations stall out before any trajectories get
> written.
>
> I have tried many different -np and -ntomp options as well as several slurm
> job submission scripts with node/ thread configurations but multiple
> simulations per node will not work.  I need to be able to run several
> replicas on the same node to get enough data since it's hard to get more
> than 8 nodes (and as a result, replicas).
>
> Thanks for your reply.
>
> -Dan
>
> On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns  wrote:
>
> > Hi,
> >
> > I continue to have trouble getting an REMD job to run.  It never makes it
> > to the point that it generates trajectory files but it never gives any
> > error either.
> >
> > I have switched from a large TREMD with 72 replicas to the Plumed
> > Hamiltonian method with only 6 replicas.  Everything is now on one node
> and
> > each replica has 6 cores.  I've turned off the dynamic load balancing on
> > this attempt per the recommendation from the Plumed site.
> >
> > Any ideas on how to troubleshoot?
> >
> > Thank you,
> >
> > Dan
> >
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] REMD stall out

2020-02-17 Thread Daniel Burns
HI Szilard,

I've deleted all my output but all the writing to the log and console stops
around the step noting the domain decomposition (or other preliminary
task).  It is the same with or without Plumed - the TREMD with Gromacs only
was the first thing to present this issue.

I've discovered that if each replica is assigned its own node, the
simulations proceed.  If I try to run several replicas on each node
(divided evenly), the simulations stall out before any trajectories get
written.

I have tried many different -np and -ntomp options as well as several slurm
job submission scripts with node/ thread configurations but multiple
simulations per node will not work.  I need to be able to run several
replicas on the same node to get enough data since it's hard to get more
than 8 nodes (and as a result, replicas).

Thanks for your reply.

-Dan

On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns  wrote:

> Hi,
>
> I continue to have trouble getting an REMD job to run.  It never makes it
> to the point that it generates trajectory files but it never gives any
> error either.
>
> I have switched from a large TREMD with 72 replicas to the Plumed
> Hamiltonian method with only 6 replicas.  Everything is now on one node and
> each replica has 6 cores.  I've turned off the dynamic load balancing on
> this attempt per the recommendation from the Plumed site.
>
> Any ideas on how to troubleshoot?
>
> Thank you,
>
> Dan
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.


Re: [gmx-users] REMD stall out

2020-02-17 Thread Szilárd Páll
Hi,

If I understand correctly your jobs stall, what is in the log output? What
about the console? Does this happen without PLUMED?

--
Szilárd


On Tue, Feb 11, 2020 at 7:56 PM Daniel Burns  wrote:

> Hi,
>
> I continue to have trouble getting an REMD job to run.  It never makes it
> to the point that it generates trajectory files but it never gives any
> error either.
>
> I have switched from a large TREMD with 72 replicas to the Plumed
> Hamiltonian method with only 6 replicas.  Everything is now on one node and
> each replica has 6 cores.  I've turned off the dynamic load balancing on
> this attempt per the recommendation from the Plumed site.
>
> Any ideas on how to troubleshoot?
>
> Thank you,
>
> Dan
> --
> Gromacs Users mailing list
>
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
> posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
>
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

[gmx-users] REMD stall out

2020-02-11 Thread Daniel Burns
Hi,

I continue to have trouble getting an REMD job to run.  It never makes it
to the point that it generates trajectory files but it never gives any
error either.

I have switched from a large TREMD with 72 replicas to the Plumed
Hamiltonian method with only 6 replicas.  Everything is now on one node and
each replica has 6 cores.  I've turned off the dynamic load balancing on
this attempt per the recommendation from the Plumed site.

Any ideas on how to troubleshoot?

Thank you,

Dan
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.