Re: [OMPI users] 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)
It's likely a BIOS bug. But I can't say more until you send the relevant data as explained earlier. Brice Le 20/12/2014 18:10, Sergio Manzetti a écrit : > Dear Brice, the BIOS is the most latest. However, i wonder if this > could be a hardware error, as openmpi sources claim. Is there any > way to find out if this is a hardware error? > > Thanks > > > > From: users-requ...@open-mpi.org > > Subject: users Digest, Vol 3074, Issue 1 > > To: us...@open-mpi.org > > Date: Sat, 20 Dec 2014 12:00:02 -0500 > > > > Send users mailing list submissions to > > us...@open-mpi.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > or, via email, send a message with subject or body 'help' to > > users-requ...@open-mpi.org > > > > You can reach the person managing the list at > > users-ow...@open-mpi.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of users digest..." > > > > > > Today's Topics: > > > > 1. Re: Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5 > > (Jeff Squyres (jsquyres)) > > 2. Hwloc error with Openmpi 1.8.3 on AMD 64 (Sergio Manzetti) > > 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin) > > 4. best function to send data (Diego Avesani) > > > > > > -- > > > > Message: 1 > > Date: Fri, 19 Dec 2014 19:26:58 + > > From: "Jeff Squyres (jsquyres)"> > To: "Open MPI User's List" > > Cc: "petsc-ma...@mcs.anl.gov" > > Subject: Re: [OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5 > > Message-ID: <027ab453-de85-4f08-bdd7-a676ca90e...@cisco.com> > > Content-Type: text/plain; charset="us-ascii" > > > > On Dec 19, 2014, at 10:44 AM, George Bosilca > wrote: > > > > > Regarding your second point, while I do tend to agree that such > issue is better addressed in the MPI Forum, the last attempt to fix > this was certainly not a resounding success. > > > > Yeah, fair enough -- but it wasn't a failure, either. It could > definitely be moved forward, but it will take time/effort, which I > unfortunately don't have. I would be willing, however, to spin up > someone who *does* have time/effort available to move the proposal > forward. > > > > > Indeed, there is a slight window of opportunity for > inconsistencies in the recursive behavior. > > > > You're right; it's a small window in the threading case, but a) > that's the worst kind :-), and b) the non-threaded case is actually > worse (because the global state can change from underneath the loop). > > > > > But the inconsistencies were already in the code, especially in > the single threaded case. As we never received any complaints related > to this topic I did not deemed interesting to address them with my > last commit. Moreover, the specific behavior needed by PETSc is > available in Open MPI when compiled without thread support, as the > only thing that "protects" the attributes is that global mutex. > > > > Mmmm. Ok, I see your point. But this is a (very) slippery slope. > > > > > For example, in ompi_attr_delete_all(), it gets the count of all > attributes and then loops times to delete each attribute. But > each attribute callback can now insert or delete attributes on that > entity. This can mean that the loop can either fail to delete an > attribute (because some attribute callback already deleted it) or fail > to delete *all* attributes (because some attribute callback added more). > > > > > > To be extremely precise the deletion part is always correct > > > > ...as long as the hash map is not altered from the application > (e.g., by adding or deleting another attribute during a callback). > > > > I understand that you mention above that you're not worried about > this case. I'm just picking here because there is quite definitely a > case where the loop is *not* correct. PETSc apparently doesn't trigger > this badness, but... like I said above, it's a (very) slippery slope. > > > > > as it copies the values to be deleted into a temporary array > before calling any callbacks (and before releasing the mutex), so we > only remove what was in the object attribute hash when the function > was called. Don't misunderstand we have an extremely good reason to do > it this way, we need to call the callbacks in the order in which they > were created (mandated by the MPI standard). > > > > > > ompi_attr_copy_all() has similar problems -- in general, the hash > that it is looping over can change underneath it. > > > > > > For the copy it is a little bit more tricky, as the calling order > is not imposed. Our peculiar implementation of the hash table (with > array) makes the code work, with a single (possible minor) exception > when the hash table itself is grown between 2 calls. However, as > stated before this issue was already present in the code in single > threaded
[OMPI users] 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)
Dear Brice, the BIOS is the most latest. However, i wonder if this could be a hardware error, as openmpi sources claim. Is there any way to find out if this is a hardware error? Thanks > From: users-requ...@open-mpi.org > Subject: users Digest, Vol 3074, Issue 1 > To: us...@open-mpi.org > Date: Sat, 20 Dec 2014 12:00:02 -0500 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > >1. Re: Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5 > (Jeff Squyres (jsquyres)) >2. Hwloc error with Openmpi 1.8.3 on AMD 64 (Sergio Manzetti) >3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin) >4. best function to send data (Diego Avesani) > > > -- > > Message: 1 > Date: Fri, 19 Dec 2014 19:26:58 + > From: "Jeff Squyres (jsquyres)"> To: "Open MPI User's List" > Cc: "petsc-ma...@mcs.anl.gov" > Subject: Re: [OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5 > Message-ID: <027ab453-de85-4f08-bdd7-a676ca90e...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > On Dec 19, 2014, at 10:44 AM, George Bosilca wrote: > > > Regarding your second point, while I do tend to agree that such issue is > > better addressed in the MPI Forum, the last attempt to fix this was > > certainly not a resounding success. > > Yeah, fair enough -- but it wasn't a failure, either. It could definitely be > moved forward, but it will take time/effort, which I unfortunately don't > have. I would be willing, however, to spin up someone who *does* have > time/effort available to move the proposal forward. > > > Indeed, there is a slight window of opportunity for inconsistencies in the > > recursive behavior. > > You're right; it's a small window in the threading case, but a) that's the > worst kind :-), and b) the non-threaded case is actually worse (because the > global state can change from underneath the loop). > > > But the inconsistencies were already in the code, especially in the single > > threaded case. As we never received any complaints related to this topic I > > did not deemed interesting to address them with my last commit. Moreover, > > the specific behavior needed by PETSc is available in Open MPI when > > compiled without thread support, as the only thing that "protects" the > > attributes is that global mutex. > > Mmmm. Ok, I see your point. But this is a (very) slippery slope. > > > For example, in ompi_attr_delete_all(), it gets the count of all attributes > > and then loops times to delete each attribute. But each attribute > > callback can now insert or delete attributes on that entity. This can mean > > that the loop can either fail to delete an attribute (because some > > attribute callback already deleted it) or fail to delete *all* attributes > > (because some attribute callback added more). > > > > To be extremely precise the deletion part is always correct > > ...as long as the hash map is not altered from the application (e.g., by > adding or deleting another attribute during a callback). > > I understand that you mention above that you're not worried about this case. > I'm just picking here because there is quite definitely a case where the loop > is *not* correct. PETSc apparently doesn't trigger this badness, but... like > I said above, it's a (very) slippery slope. > > > as it copies the values to be deleted into a temporary array before calling > > any callbacks (and before releasing the mutex), so we only remove what was > > in the object attribute hash when the function was called. Don't > > misunderstand we have an extremely good reason to do it this way, we need > > to call the callbacks in the order in which they were created (mandated by > > the MPI standard). > > > > ompi_attr_copy_all() has similar problems -- in general, the hash that it > > is looping over can change underneath it. > > > > For the copy it is a little bit more tricky, as the calling order is not > > imposed. Our peculiar implementation of the hash table (with array) makes > > the code work, with a single (possible minor) exception when the hash table > > itself is grown between 2 calls. However, as stated before this issue was > > already present in the code in single threaded cases for years. Addressing > > it is another 2 line patch, but I leave this exercise to an interested > > reader. > > Yeah, thanks for that. :-) > > To