On Wed, Mar 16, 2016 at 4:49 PM, Cabral, Matias A <matias.a.cab...@intel.com> wrote: > I didn't go into the code to see who is actually calling this error message, > but I suspect this may be a generic error for "out of memory" kind of thing > and not specific to the que pair. To confirm please add -mca > pml_base_verbose 100 and add -mca mtl_base_verbose 100 to see what is being > selected.
this didn't spit out anything overly useful, just lots of lines [node001:00909] mca: base: components_register: registering pml components [node001:00909] mca: base: components_register: found loaded component v [node001:00909] mca: base: components_register: component v register function successful [node001:00909] mca: base: components_register: found loaded component bfo [node001:00909] mca: base: components_register: component bfo register function successful [node001:00909] mca: base: components_register: found loaded component cm [node001:00909] mca: base: components_register: component cm register function successful [node001:00909] mca: base: components_register: found loaded component ob1 [node001:00909] mca: base: components_register: component ob1 register function successful > I'm trying to remember some details of IMB and alltoallv to see if it is > indeed requiring more resources that the other micro benchmarks. i'm using IMB for my tests, but this issue came up because a researcher isn't able to run large alltoall codes, so i don't believe it's specific to IMB > BTW, did you confirm the limits setup? Also do the nodes have all the same > amount of mem? yes, all nodes have the limits set to unlimited and each node has 256GB of memory