Re: [gmx-users] REMD stall out
This was not actually the solution. Wanted to follow up in case someone else is experiencing this problem. We are reinstalling the openmp version. On Thu, Feb 20, 2020 at 3:10 PM Daniel Burns wrote: > Hi again, > > It seems including our openmp module was responsible for the issue the > whole time. When I submit the job only loading pmix and gromacs, replica > exchange proceeds. > > Thank you, > > Dan > > On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham > wrote: > >> Hi, >> >> That could be caused by configuration of the parallel file system or MPI >> on >> your cluster. If only one file descriptor is available per node to an MPI >> job, then your symptoms are explained. Some kinds of compute jobs follow >> such a model, so maybe someone optimized something for that. >> >> Mark >> >> On Mon, 17 Feb 2020 at 15:56, Daniel Burns wrote: >> >> > HI Szilard, >> > >> > I've deleted all my output but all the writing to the log and console >> stops >> > around the step noting the domain decomposition (or other preliminary >> > task). It is the same with or without Plumed - the TREMD with Gromacs >> only >> > was the first thing to present this issue. >> > >> > I've discovered that if each replica is assigned its own node, the >> > simulations proceed. If I try to run several replicas on each node >> > (divided evenly), the simulations stall out before any trajectories get >> > written. >> > >> > I have tried many different -np and -ntomp options as well as several >> slurm >> > job submission scripts with node/ thread configurations but multiple >> > simulations per node will not work. I need to be able to run several >> > replicas on the same node to get enough data since it's hard to get more >> > than 8 nodes (and as a result, replicas). >> > >> > Thanks for your reply. >> > >> > -Dan >> > >> > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns >> wrote: >> > >> > > Hi, >> > > >> > > I continue to have trouble getting an REMD job to run. It never >> makes it >> > > to the point that it generates trajectory files but it never gives any >> > > error either. >> > > >> > > I have switched from a large TREMD with 72 replicas to the Plumed >> > > Hamiltonian method with only 6 replicas. Everything is now on one >> node >> > and >> > > each replica has 6 cores. I've turned off the dynamic load balancing >> on >> > > this attempt per the recommendation from the Plumed site. >> > > >> > > Any ideas on how to troubleshoot? >> > > >> > > Thank you, >> > > >> > > Dan >> > > >> > -- >> > Gromacs Users mailing list >> > >> > * Please search the archive at >> > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> > posting! >> > >> > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> > >> > * For (un)subscribe requests visit >> > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> > send a mail to gmx-users-requ...@gromacs.org. >> > >> -- >> Gromacs Users mailing list >> >> * Please search the archive at >> http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before >> posting! >> >> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists >> >> * For (un)subscribe requests visit >> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or >> send a mail to gmx-users-requ...@gromacs.org. >> > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] REMD stall out
Hi again, It seems including our openmp module was responsible for the issue the whole time. When I submit the job only loading pmix and gromacs, replica exchange proceeds. Thank you, Dan On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham wrote: > Hi, > > That could be caused by configuration of the parallel file system or MPI on > your cluster. If only one file descriptor is available per node to an MPI > job, then your symptoms are explained. Some kinds of compute jobs follow > such a model, so maybe someone optimized something for that. > > Mark > > On Mon, 17 Feb 2020 at 15:56, Daniel Burns wrote: > > > HI Szilard, > > > > I've deleted all my output but all the writing to the log and console > stops > > around the step noting the domain decomposition (or other preliminary > > task). It is the same with or without Plumed - the TREMD with Gromacs > only > > was the first thing to present this issue. > > > > I've discovered that if each replica is assigned its own node, the > > simulations proceed. If I try to run several replicas on each node > > (divided evenly), the simulations stall out before any trajectories get > > written. > > > > I have tried many different -np and -ntomp options as well as several > slurm > > job submission scripts with node/ thread configurations but multiple > > simulations per node will not work. I need to be able to run several > > replicas on the same node to get enough data since it's hard to get more > > than 8 nodes (and as a result, replicas). > > > > Thanks for your reply. > > > > -Dan > > > > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns > wrote: > > > > > Hi, > > > > > > I continue to have trouble getting an REMD job to run. It never makes > it > > > to the point that it generates trajectory files but it never gives any > > > error either. > > > > > > I have switched from a large TREMD with 72 replicas to the Plumed > > > Hamiltonian method with only 6 replicas. Everything is now on one node > > and > > > each replica has 6 cores. I've turned off the dynamic load balancing > on > > > this attempt per the recommendation from the Plumed site. > > > > > > Any ideas on how to troubleshoot? > > > > > > Thank you, > > > > > > Dan > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] REMD stall out
Thanks Mark and Szilard, I forwarded Mark's suggestion to IT. I'll see what they have to say and then I'll try the simulation again and open an issue on redime. Thank you, Dan On Mon, Feb 17, 2020 at 9:09 AM Mark Abraham wrote: > Hi, > > That could be caused by configuration of the parallel file system or MPI on > your cluster. If only one file descriptor is available per node to an MPI > job, then your symptoms are explained. Some kinds of compute jobs follow > such a model, so maybe someone optimized something for that. > > Mark > > On Mon, 17 Feb 2020 at 15:56, Daniel Burns wrote: > > > HI Szilard, > > > > I've deleted all my output but all the writing to the log and console > stops > > around the step noting the domain decomposition (or other preliminary > > task). It is the same with or without Plumed - the TREMD with Gromacs > only > > was the first thing to present this issue. > > > > I've discovered that if each replica is assigned its own node, the > > simulations proceed. If I try to run several replicas on each node > > (divided evenly), the simulations stall out before any trajectories get > > written. > > > > I have tried many different -np and -ntomp options as well as several > slurm > > job submission scripts with node/ thread configurations but multiple > > simulations per node will not work. I need to be able to run several > > replicas on the same node to get enough data since it's hard to get more > > than 8 nodes (and as a result, replicas). > > > > Thanks for your reply. > > > > -Dan > > > > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns > wrote: > > > > > Hi, > > > > > > I continue to have trouble getting an REMD job to run. It never makes > it > > > to the point that it generates trajectory files but it never gives any > > > error either. > > > > > > I have switched from a large TREMD with 72 replicas to the Plumed > > > Hamiltonian method with only 6 replicas. Everything is now on one node > > and > > > each replica has 6 cores. I've turned off the dynamic load balancing > on > > > this attempt per the recommendation from the Plumed site. > > > > > > Any ideas on how to troubleshoot? > > > > > > Thank you, > > > > > > Dan > > > > > -- > > Gromacs Users mailing list > > > > * Please search the archive at > > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > > posting! > > > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > * For (un)subscribe requests visit > > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > > send a mail to gmx-users-requ...@gromacs.org. > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] REMD stall out
Hi, That could be caused by configuration of the parallel file system or MPI on your cluster. If only one file descriptor is available per node to an MPI job, then your symptoms are explained. Some kinds of compute jobs follow such a model, so maybe someone optimized something for that. Mark On Mon, 17 Feb 2020 at 15:56, Daniel Burns wrote: > HI Szilard, > > I've deleted all my output but all the writing to the log and console stops > around the step noting the domain decomposition (or other preliminary > task). It is the same with or without Plumed - the TREMD with Gromacs only > was the first thing to present this issue. > > I've discovered that if each replica is assigned its own node, the > simulations proceed. If I try to run several replicas on each node > (divided evenly), the simulations stall out before any trajectories get > written. > > I have tried many different -np and -ntomp options as well as several slurm > job submission scripts with node/ thread configurations but multiple > simulations per node will not work. I need to be able to run several > replicas on the same node to get enough data since it's hard to get more > than 8 nodes (and as a result, replicas). > > Thanks for your reply. > > -Dan > > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns wrote: > > > Hi, > > > > I continue to have trouble getting an REMD job to run. It never makes it > > to the point that it generates trajectory files but it never gives any > > error either. > > > > I have switched from a large TREMD with 72 replicas to the Plumed > > Hamiltonian method with only 6 replicas. Everything is now on one node > and > > each replica has 6 cores. I've turned off the dynamic load balancing on > > this attempt per the recommendation from the Plumed site. > > > > Any ideas on how to troubleshoot? > > > > Thank you, > > > > Dan > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] REMD stall out
Hi Dan, What you describe in not an expected behaviro and it is something we should look into. What GROMACS version were you using? One thing that may help diagnosing the issue is: try to disable replica exchange and run -multidir that way. Does the simulation proceed? Can you please open an issue on redmine.gromacs.org and upload the necessary input files to reproduce, logs of your runs that reproduced the issue. Cheers, -- Szilárd On Mon, Feb 17, 2020 at 3:56 PM Daniel Burns wrote: > HI Szilard, > > I've deleted all my output but all the writing to the log and console stops > around the step noting the domain decomposition (or other preliminary > task). It is the same with or without Plumed - the TREMD with Gromacs only > was the first thing to present this issue. > > I've discovered that if each replica is assigned its own node, the > simulations proceed. If I try to run several replicas on each node > (divided evenly), the simulations stall out before any trajectories get > written. > > I have tried many different -np and -ntomp options as well as several slurm > job submission scripts with node/ thread configurations but multiple > simulations per node will not work. I need to be able to run several > replicas on the same node to get enough data since it's hard to get more > than 8 nodes (and as a result, replicas). > > Thanks for your reply. > > -Dan > > On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns wrote: > > > Hi, > > > > I continue to have trouble getting an REMD job to run. It never makes it > > to the point that it generates trajectory files but it never gives any > > error either. > > > > I have switched from a large TREMD with 72 replicas to the Plumed > > Hamiltonian method with only 6 replicas. Everything is now on one node > and > > each replica has 6 cores. I've turned off the dynamic load balancing on > > this attempt per the recommendation from the Plumed site. > > > > Any ideas on how to troubleshoot? > > > > Thank you, > > > > Dan > > > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] REMD stall out
HI Szilard, I've deleted all my output but all the writing to the log and console stops around the step noting the domain decomposition (or other preliminary task). It is the same with or without Plumed - the TREMD with Gromacs only was the first thing to present this issue. I've discovered that if each replica is assigned its own node, the simulations proceed. If I try to run several replicas on each node (divided evenly), the simulations stall out before any trajectories get written. I have tried many different -np and -ntomp options as well as several slurm job submission scripts with node/ thread configurations but multiple simulations per node will not work. I need to be able to run several replicas on the same node to get enough data since it's hard to get more than 8 nodes (and as a result, replicas). Thanks for your reply. -Dan On Tue, Feb 11, 2020 at 12:56 PM Daniel Burns wrote: > Hi, > > I continue to have trouble getting an REMD job to run. It never makes it > to the point that it generates trajectory files but it never gives any > error either. > > I have switched from a large TREMD with 72 replicas to the Plumed > Hamiltonian method with only 6 replicas. Everything is now on one node and > each replica has 6 cores. I've turned off the dynamic load balancing on > this attempt per the recommendation from the Plumed site. > > Any ideas on how to troubleshoot? > > Thank you, > > Dan > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
Re: [gmx-users] REMD stall out
Hi, If I understand correctly your jobs stall, what is in the log output? What about the console? Does this happen without PLUMED? -- Szilárd On Tue, Feb 11, 2020 at 7:56 PM Daniel Burns wrote: > Hi, > > I continue to have trouble getting an REMD job to run. It never makes it > to the point that it generates trajectory files but it never gives any > error either. > > I have switched from a large TREMD with 72 replicas to the Plumed > Hamiltonian method with only 6 replicas. Everything is now on one node and > each replica has 6 cores. I've turned off the dynamic load balancing on > this attempt per the recommendation from the Plumed site. > > Any ideas on how to troubleshoot? > > Thank you, > > Dan > -- > Gromacs Users mailing list > > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before > posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. > -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.
[gmx-users] REMD stall out
Hi, I continue to have trouble getting an REMD job to run. It never makes it to the point that it generates trajectory files but it never gives any error either. I have switched from a large TREMD with 72 replicas to the Plumed Hamiltonian method with only 6 replicas. Everything is now on one node and each replica has 6 cores. I've turned off the dynamic load balancing on this attempt per the recommendation from the Plumed site. Any ideas on how to troubleshoot? Thank you, Dan -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.