Hi, I ran those cmd's and have posted the outputs on: https://svn.open-mpi.org/trac/ompi/ticket/3076
-mca shmem posix worked for all -np (even when oversubscribing), however sysv did not work for any -np. On Tue, Apr 24, 2012 at 5:36 PM, Gutierrez, Samuel K <sam...@lanl.gov>wrote: > Hi, > > Just out of curiosity, what happens when you add > > -mca shmem posix > > to your mpirun command line using 1.5.5? > > Can you also please try: > > -mca shmem sysv > > I'm shooting in the dark here, but I want to make sure that the failure > isn't due to a small backing store. > > Thanks, > > Sam > > On Apr 16, 2012, at 8:57 AM, Gutierrez, Samuel K wrote: > > Hi, > > Sorry about the lag. I'll take a closer look at this ASAP. > > Appreciate your patience, > > Sam > ------------------------------ > *From:* users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf > of Ralph Castain [r...@open-mpi.org] > *Sent:* Monday, April 16, 2012 8:52 AM > *To:* Seyyed Mohtadin Hashemi > *Cc:* us...@open-mpi.org > *Subject:* Re: [OMPI users] OpenMPI fails to run with -np larger than 10 > > No earthly idea. As I said, I'm afraid Sam is pretty much unavailable > for the next two weeks, so we probably don't have much hope of fixing it. > > I see in your original note that you tried the 1.5.5 beta rc and got the > same results, so I assume this must be something in your system config > that is causing the issue. I'll file a bug for him (pointing to this > thread) so this doesn't get lost, but would suggest you run ^sm for now > unless someone else has other suggestions. > > > On Apr 16, 2012, at 2:57 AM, Seyyed Mohtadin Hashemi wrote: > > I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI > 1.4.5 tarball. > > I did some tests and it does not seem that i can make it work, i tried > these: > > btl_sm_num_fifos 4 > btl_sm_free_list_num 1000 > btl_sm_free_list_max 1000000 > mpool_sm_min_size 1500000000 > mpool_sm_max_size 7500000000 > > but nothing helped. I started out with varying one parameter at the time > from default to 1000000 (except fifo which i only varied till 100, and > sm_min and sm_max which i varied from 67mb [default was set to 67xxxxxx] to > 7.5gb) to see what reactions i could get. When running with 10 npp > everything worked, but as soon as i went to 11 npp it crashed with the same > old error. > > On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> >> On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote: >> >> That fixed the issue but have brought a big question mark on why this >> happened. >> >> I'm pretty sure it's not a system memory issue, the node with least RAM >> has 8gb which i would think is more than enough. >> >> Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size, >> and mpool_sm_max_size can help fix the problem? (Found this at >> http://www.open-mpi.org/faq/?category=sm ) Because compared to the -np >> 10 the performance of -np 18 is worse when running with the cmd you >> suggested. I'll try playing around with the parameters and see what works. >> >> >> Yes, performance will definitely be worse - I was just trying to >> isolate the problem. I would play a little with those sizes and see what >> you can do. Our shared memory person is pretty much unavailable for the >> next two weeks, but the rest of us will at least try to get you working. >> >> We typically do run with more than 10 ppn, so I know the base sm code >> works at that scale. However, those nodes usually have 32Gbytes of RAM, and >> the default sm params are scaled accordingly. >> >> >> >> On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Afraid I have no idea how those packages were built, what release they >>> correspond to, etc. I would suggest sticking with the tarballs. >>> >>> Your output indicates a problem with shared memory when you >>> completely fill the machine. Could be a couple of things, like running out >>> of memory - but for now, try adding -mca btl ^sm to your cmd line. Should >>> work. >>> >>> >>> On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote: >>> >>> Hi, >>> >>> Sorry that it took so long to answer, I didn't get any return mails >>> and had to check the digest for reply. >>> >>> Anyway, when i compiled from scratch then i did use the tarballs from >>> open-mpi.org. GROMACS is not the problem (or at least i don't think >>> so), i just used it as a check to see if i could run parallel jobs - i am >>> now using OSU benchmarks because i can't be sure that the problem is not >>> with GROMACS. >>> >>> On the new installation i have not installed (nor compiled) OMPI from >>> the official tarballs but rather installed the "openmpi-bin, >>> openmpi-common, libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev" >>> packages using apt-get. >>> >>> As for the simple examples (i.e. ring_c, hello_c, and connectivity_c >>> extracted from the 1.4.2 official tarball) i get the exact same behavior as >>> with GROMACS/OSU bench. >>> >>> I suspect you'll have to ask someone familiar with GROMACS about that >>>> specific package. As for testing OMPI, can you run the codes in the >>>> examples directory - e.g., "hello" and "ring"? I assume you are downloading >>>> and installing OMPI from our tarballs? >>>> >>> >>>> On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote: >>>> >>> >>>> > Hello, >>>> >>> > >>>> >>> > I have a very peculiar problem: I have a micro cluster with three >>>> nodes (18 cores total); the nodes are clones of each other and connected to >>>> a frontend via Ethernet and Debian squeeze as the OS for all nodes. When I >>>> run parallel jobs I can used up ?-np 10? if I go further the job crashes, I >>>> have primarily done tests with GROMACS (because that is what I will be >>>> running) but have also used OSU Micro-Benchmarks 3.5.2. >>>> >>> > >>>> >>> > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile >>>> ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr? >>>> >>> > >>>> >>> > (path is global) For ?np XX being smaller than or 10 it works, >>>> however as soon as I make use of 11 or larger the whole thing crashes. The >>>> terminal dump is attached to this mail: when_working.txt is for ??np 10?, >>>> when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output from >>>> ?path/mpirun --bynode --hostfile path/hostfile --tag-output ompi_info -v >>>> ompi full ?parsable? >>>> >>> > >>>> >>> > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all >>>> yield the same result. >>>> >>> > >>>> >>> > The output files are from a new install I did today: I formatted all >>>> nodes and started from a fresh minimal install of Squeeze and used "apt-get >>>> install gromacs gromacs-openmpi" and installed all dependencies. Then I ran >>>> two jobs using the parameters described above, I also did one with OSU >>>> bench (data is not included) it also crashed with ?-np? larger than 10. >>>> >>> > >>>> >>> > I hope somebody can help figure out what is wrong and how I can fix >>>> it. >>>> >>> > >>>> >>> > Best regards, >>>> >>> > Mohtadin >>>> >>> > >>>> >>> > >>>> ***************************************************************************** >>>> >>> > ** ** >>>> >>> > ** WARNING: This email contains an attachment of a very suspicious >>>> type. ** >>>> >>> > ** You are urged NOT to open this attachment unless you are >>>> absolutely ** >>>> >>> > ** sure it is legitimate. Opening this attachment may cause >>>> irreparable ** >>>> >>> > ** damage to your computer and your files. If you have any questions >>>> ** >>>> >>> > ** about the validity of this message, PLEASE SEEK HELP BEFORE >>>> OPENING IT. ** >>>> >>> > ** ** >>>> >>> > ** This warning was added by the IU Computer Science Dept. mail >>>> scanner. ** >>>> >>> > >>>> ***************************************************************************** >>>> >>> > >>>> >>> > <Archive.zip>_______________________________________________ >>>> >>> > users mailing list >>>> >>> > us...@open-mpi.org >>>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >> >> >> -- >> De venligste hilsner/I am, yours most sincerely >> Seyyed Mohtadin Hashemi >> >> >> > > > -- > De venligste hilsner/I am, yours most sincerely > Seyyed Mohtadin Hashemi > > > > > -- De venligste hilsner/I am, yours most sincerely Seyyed Mohtadin Hashemi