[Users] Einstein Toolkit Meeting Reminder

2023-03-22 Thread reminders
Hello,

Please consider joining the weekly Einstein Toolkit phone call at
9:00 am US central time on Thursdays. For details on how to connect
and what agenda items are to be discussed, use the link below.

** DAYLIGHT SAVING TIME WARNING **
Please note that the US / EU has already / not yet transitioned to /
from daylight saving time. The phone call will be at 15:00 Central EU
time.

https://docs.einsteintoolkit.org/et-docs/Main_Page#Weekly_Users_Call

--The Maintainers
___
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users


Re: [Users] BNSM/TOV simulation error

2023-03-22 Thread Roland Haas
Hello Spandan Sarma,

I am not aware of any method other than trial and error I am afraid.
This is somewhat similar to the situation of scaling up a simulation
setup where one also starts with a small number of nodes then increases
the number of nodes until the speed is acceptable for the cost of the
nodes used.

In principle, the code should never fails due to too many MPI ranks, it
should either just become slow (due to increase communication overhead)
or output an error message (eg when there are not enough points to have
even a single point on a MPI rank).

Yours,
Roland

> Dear Roland,
> 
> Thank you so much for the help. I included your suggestions and tried
> running the TOV with 16 cores with increased resolution, and it worked
> successfully. I have submitted a BNSM simulation making similar relevant
> changes and am awaiting its result.
> 
> Also, is there any way other than trial and error to calculate how many MPI
> ranks are too much for a simulation?
> 
> Regards,
> Spandan Sarma
> 
> On Mon, Mar 20, 2023 at 9:45 PM Roland Haas  wrote:
> 
> > Hello Spandan Sarma,
> >
> > Not having looked very carefully yet, one thing that turned out an
> > issue in the last while has been that the gallery example (see
> > https://urldefense.com/v3/__http://einsteintoolkit.org/gallery/bns/index.html__;!!DZ3fjg!_R1hP5KhkYLYZXjJxyT5mxkfQ99j1DQVjDDtj0YEgJHaUNRhvfeMeGUI1Xceb72878eBfcixyBFVhOcVYoi46W0$
> >  ) is "small" and set
> > up to run (see the web-page) 24 hours using 12 cores. Running on many
> > more cores (MPI ranks really) can lead to these issues.
> >
> > So the first step would be to make sure that you run small enough (I
> > would try for no more than 24 or so MPI ranks, and usually more than 8
> > threads per MPI rank is not helping) and verify that the example works.
> >
> > Then, you can increase the resolution (the dx, dy, dz parameters in the
> > parameter file *.par) to make sure that that NS are resolved well
> > (resolution on the refinement level that contains them better than say
> > 200m at least) and slowly scale up the number of cores to use until you
> > have acceptable run speed.
> >
> > Based on your log files there were 16 MPI ranks for the TOV example
> > (which last ran on 5 MPI ranks) and 144 MPI ranks for BNS (which was
> > last run on 12 MPI ranks). In particular the latter one is "too many"
> > and I suspect the error is due to that.
> >
> > Yours,
> > Roland
> >  
> > > Hello,
> > >
> > > I was trying to run the BNSM simulation from the ET gallery on the
> > > institute cluster KANAD at IISER Bhopal in the short queue (max nodes:  
> > 16;  
> > > walltime: 24 hrs) of our queuing system, but the following error came up:
> > >
> > > The grid structure is inconsistent.  It is impossible to continue.
> > >
> > > WARNING level 0 from host n16 process 0
> > >
> > >   in thorn CarpetLib, file
> > >  
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> >   
> > >  
> > >   -> The grid structure is inconsistent.  It is impossible to continue.  
> > >
> > > cactus_sim:
> > >  
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> >   
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > > Rank 0 with PID 4473 received signal 6
> > >
> > > Writing backtrace to nsnstohmns1/backtrace.0.txt
> > >
> > > WARNING level 0 from host n63 process 128
> > >
> > >   in thorn CarpetLib, file
> > >  
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> >   
> > >  
> > >   -> The grid structure is inconsistent.  It is impossible to continue.  
> > >
> > > cactus_sim:
> > >  
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> >   
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > > Rank 128 with PID 1350 received signal 6
> > >
> > > Writing backtrace to nsnstohmns1/backtrace.128.txt
> > >
> > > WARNING level 0 from host n63 process 141
> > >
> > >   in thorn CarpetLib, file
> > >  
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> >   
> > >  
> > >   -> The grid structure is inconsistent.  It is impossible to continue.  
> > >
> > > cactus_sim:
> > >  
> > /home2/shamims/ET_short1/Cactus/arrangements/Carpet/Carpet/src/helpers.cc:275:
> >   
> > > int Carpet::Abort(const cGH*, int): Assertion `0' failed.
> > >
> > >
> > > After this issue, I tried performing the simulation using the same
> > > parameter file in the debug queue (max:1 node), and it worked fine. But
> > > upon trying out the TOV simulation example in the debug queue, the same
> > > error came:
> > >
> > >
> > > [1mWARNING level 0 from host n85 process 0
> > >
> > >   in thorn CarpetLib, file
> > >  
> > /home2/shamims/ET_debug/Cactus/arrangements/Carpet/CarpetLib/src/dh.cc:2105:
> >   
> > >  
> > >   -> [0m The grid structure is inconsistent.  It is impossible to  
> > continue.  
> > >
> > > WARNING level 0 from host n85 process