Hi,

I ran those cmd's and have posted the outputs on:
https://svn.open-mpi.org/trac/ompi/ticket/3076

-mca shmem posix worked for all -np (even when oversubscribing), however
sysv did not work for any -np.

On Tue, Apr 24, 2012 at 5:36 PM, Gutierrez, Samuel K <sam...@lanl.gov>wrote:

>  Hi,
>
>  Just out of curiosity, what happens when you add
>
>  -mca shmem posix
>
>  to your mpirun command line using 1.5.5?
>
>  Can you also please try:
>
>  -mca shmem sysv
>
>  I'm shooting in the dark here, but I want to make sure that the failure
> isn't due to a small backing store.
>
>  Thanks,
>
>  Sam
>
>  On Apr 16, 2012, at 8:57 AM, Gutierrez, Samuel K wrote:
>
>   Hi,
>
>  Sorry about the lag.  I'll take a closer look at this ASAP.
>
>  Appreciate your patience,
>
>  Sam
>  ------------------------------
> *From:* users-boun...@open-mpi.org [users-boun...@open-mpi.org] on behalf
> of Ralph Castain [r...@open-mpi.org]
> *Sent:* Monday, April 16, 2012 8:52 AM
> *To:* Seyyed Mohtadin Hashemi
> *Cc:* us...@open-mpi.org
> *Subject:* Re: [OMPI users] OpenMPI fails to run with -np larger than 10
>
>  No earthly idea. As I said, I'm afraid Sam is pretty much unavailable
> for the next two weeks, so we probably don't have much hope of fixing it.
>
>  I see in your original note that you tried the 1.5.5 beta rc and got the
> same results, so I assume this must be something in  your system config
> that is causing the issue. I'll file a bug for him (pointing to this
> thread) so this doesn't get lost, but would suggest you run ^sm for now
> unless someone else has other suggestions.
>
>
>  On Apr 16, 2012, at 2:57 AM, Seyyed Mohtadin Hashemi wrote:
>
>  I recompiled everything from scratch with GCC 4.4.5 and 4.7 using OMPI
> 1.4.5 tarball.
>
>  I did some tests and it does not seem that i can make it work, i tried
> these:
>
>  btl_sm_num_fifos 4
> btl_sm_free_list_num 1000
> btl_sm_free_list_max 1000000
> mpool_sm_min_size 1500000000
> mpool_sm_max_size 7500000000
>
>  but nothing helped. I started out with varying one parameter at the time
> from default to 1000000 (except fifo which i only varied till 100, and
> sm_min and sm_max which i varied from 67mb [default was set to 67xxxxxx] to
> 7.5gb) to see what reactions i could get. When running with 10 npp
> everything worked, but as soon as i went to 11 npp it crashed with the same
> old error.
>
> On Fri, Apr 13, 2012 at 6:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>>
>>  On Apr 13, 2012, at 10:36 AM, Seyyed Mohtadin Hashemi wrote:
>>
>>  That fixed the issue but have brought a big question mark on why this
>> happened.
>>
>>  I'm pretty sure it's not a system memory issue, the node with least RAM
>> has 8gb which i would think is more than enough.
>>
>>  Do you think that adjusting the btl_sm_eager_limit, mpool_sm_min_size,
>> and mpool_sm_max_size can help fix the problem? (Found this at
>> http://www.open-mpi.org/faq/?category=sm )  Because compared to the -np
>> 10 the performance of -np 18 is worse when running with the cmd you
>> suggested. I'll try playing around with the parameters and see what works.
>>
>>
>>  Yes, performance will definitely be worse - I was just trying to
>> isolate the problem. I would play a little with those sizes and see what
>> you can do. Our shared memory person is pretty much unavailable for the
>> next two weeks, but the rest of us will at least try to get you working.
>>
>>  We typically do run with more than 10 ppn, so I know the base sm code
>> works at that scale. However, those nodes usually have 32Gbytes of RAM, and
>> the default sm params are scaled accordingly.
>>
>>
>>
>>  On Fri, Apr 13, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Afraid I have no idea how those packages were built, what release they
>>> correspond to, etc. I would suggest sticking with the tarballs.
>>>
>>>   Your output indicates a problem with shared memory when you
>>> completely fill the machine. Could be a couple of things, like running out
>>> of memory - but for now, try adding -mca btl ^sm to your cmd line. Should
>>> work.
>>>
>>>
>>>  On Apr 13, 2012, at 5:09 AM, Seyyed Mohtadin Hashemi wrote:
>>>
>>>  Hi,
>>>
>>>  Sorry that it took so long to answer, I didn't get any return mails
>>> and had to check the digest for reply.
>>>
>>>  Anyway, when i compiled from scratch then i did use the tarballs from
>>> open-mpi.org. GROMACS is not the problem (or at least i don't think
>>> so), i just used it as a check to see if i could run parallel jobs - i am
>>> now using OSU benchmarks because i can't be sure that the problem is not
>>> with GROMACS.
>>>
>>>  On the new installation i have not installed (nor compiled) OMPI from
>>> the official tarballs but rather installed the "openmpi-bin,
>>> openmpi-common, libopenmpi1.3, openmpi-checkpoint, and libopenmpi-dev"
>>> packages using apt-get.
>>>
>>>  As for the simple examples (i.e. ring_c, hello_c, and connectivity_c
>>> extracted from the 1.4.2 official tarball) i get the exact same behavior as
>>> with GROMACS/OSU bench.
>>>
>>>  I suspect you'll have to ask someone familiar with GROMACS about that
>>>> specific package. As for testing OMPI, can you run the codes in the
>>>> examples directory - e.g., "hello" and "ring"? I assume you are downloading
>>>> and installing OMPI from our tarballs?
>>>>
>>>
>>>>  On Apr 12, 2012, at 7:04 AM, Seyyed Mohtadin Hashemi wrote:
>>>>
>>>
>>>>  > Hello,
>>>>
>>>  >
>>>>
>>>  > I have a very peculiar problem: I have a micro cluster with three
>>>> nodes (18 cores total); the nodes are clones of each other and connected to
>>>> a frontend via Ethernet and Debian squeeze as the OS for all nodes. When I
>>>> run parallel jobs I can used up ?-np 10? if I go further the job crashes, I
>>>> have primarily done tests with GROMACS (because that is what I will be
>>>> running) but have also used OSU Micro-Benchmarks 3.5.2.
>>>>
>>>  >
>>>>
>>>  > For a simple parallel job I use: ?path/mpirun ?hostfile path/hostfile
>>>> ?np XX ?d ?display-map path/mdrun_mpi ?s path/topol.tpr ?o path/output.trr?
>>>>
>>>  >
>>>>
>>>  > (path is global) For ?np XX being smaller than or 10 it works,
>>>> however as soon as I make use of 11 or larger the whole thing crashes. The
>>>> terminal dump is attached to this mail: when_working.txt is for ??np 10?,
>>>> when_crash.txt is for ??np 12?, and OpenMPI_info.txt is output from
>>>> ?path/mpirun --bynode --hostfile path/hostfile --tag-output ompi_info -v
>>>> ompi full ?parsable?
>>>>
>>>  >
>>>>
>>>  > I have tried OpenMPI v.1.4.2 all the way up to beta v1.5.5, and all
>>>> yield the same result.
>>>>
>>>  >
>>>>
>>>  > The output files are from a new install I did today: I formatted all
>>>> nodes and started from a fresh minimal install of Squeeze and used "apt-get
>>>> install gromacs gromacs-openmpi" and installed all dependencies. Then I ran
>>>> two jobs using the parameters described above, I also did one with OSU
>>>> bench (data is not included) it also crashed with ?-np? larger than 10.
>>>>
>>>  >
>>>>
>>>  > I hope somebody can help figure out what is wrong and how I can fix
>>>> it.
>>>>
>>>  >
>>>>
>>>  > Best regards,
>>>>
>>>  > Mohtadin
>>>>
>>>  >
>>>>
>>>  >
>>>> *****************************************************************************
>>>>
>>>  > ** **
>>>>
>>>  > ** WARNING: This email contains an attachment of a very suspicious
>>>> type. **
>>>>
>>>  > ** You are urged NOT to open this attachment unless you are
>>>> absolutely **
>>>>
>>>  > ** sure it is legitimate. Opening this attachment may cause
>>>> irreparable **
>>>>
>>>  > ** damage to your computer and your files. If you have any questions
>>>> **
>>>>
>>>  > ** about the validity of this message, PLEASE SEEK HELP BEFORE
>>>> OPENING IT. **
>>>>
>>>  > ** **
>>>>
>>>  > ** This warning was added by the IU Computer Science Dept. mail
>>>> scanner. **
>>>>
>>>  >
>>>> *****************************************************************************
>>>>
>>>  >
>>>>
>>>  > <Archive.zip>_______________________________________________
>>>>
>>>  > users mailing list
>>>>
>>>  > us...@open-mpi.org
>>>>
>>>  > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>
>>
>> --
>> De venligste hilsner/I am, yours most sincerely
>> Seyyed Mohtadin Hashemi
>>
>>
>>
>
>
> --
> De venligste hilsner/I am, yours most sincerely
> Seyyed Mohtadin Hashemi
>
>
>
>
>


-- 
De venligste hilsner/I am, yours most sincerely
Seyyed Mohtadin Hashemi

Reply via email to