Re: [OMPI users] 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)

2014-12-20 Thread Brice Goglin
It's likely a BIOS bug.
But I can't say more until you send the relevant data as explained earlier.
Brice



Le 20/12/2014 18:10, Sergio Manzetti a écrit :
> Dear Brice, the BIOS is the most latest. However, i wonder if this
> could be  a hardware error, as openmpi sources claim.  Is there any
> way to find out if this is a hardware error?
>
> Thanks
>
>
> > From: users-requ...@open-mpi.org
> > Subject: users Digest, Vol 3074, Issue 1
> > To: us...@open-mpi.org
> > Date: Sat, 20 Dec 2014 12:00:02 -0500
> >
> > Send users mailing list submissions to
> > us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> > users-requ...@open-mpi.org
> >
> > You can reach the person managing the list at
> > users-ow...@open-mpi.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> > 1. Re: Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5
> > (Jeff Squyres (jsquyres))
> > 2. Hwloc error with Openmpi 1.8.3 on AMD 64 (Sergio Manzetti)
> > 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)
> > 4. best function to send data (Diego Avesani)
> >
> >
> > --
> >
> > Message: 1
> > Date: Fri, 19 Dec 2014 19:26:58 +
> > From: "Jeff Squyres (jsquyres)" 
> > To: "Open MPI User's List" 
> > Cc: "petsc-ma...@mcs.anl.gov" 
> > Subject: Re: [OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5
> > Message-ID: <027ab453-de85-4f08-bdd7-a676ca90e...@cisco.com>
> > Content-Type: text/plain; charset="us-ascii"
> >
> > On Dec 19, 2014, at 10:44 AM, George Bosilca 
> wrote:
> >
> > > Regarding your second point, while I do tend to agree that such
> issue is better addressed in the MPI Forum, the last attempt to fix
> this was certainly not a resounding success.
> >
> > Yeah, fair enough -- but it wasn't a failure, either. It could
> definitely be moved forward, but it will take time/effort, which I
> unfortunately don't have. I would be willing, however, to spin up
> someone who *does* have time/effort available to move the proposal
> forward.
> >
> > > Indeed, there is a slight window of opportunity for
> inconsistencies in the recursive behavior.
> >
> > You're right; it's a small window in the threading case, but a)
> that's the worst kind :-), and b) the non-threaded case is actually
> worse (because the global state can change from underneath the loop).
> >
> > > But the inconsistencies were already in the code, especially in
> the single threaded case. As we never received any complaints related
> to this topic I did not deemed interesting to address them with my
> last commit. Moreover, the specific behavior needed by PETSc is
> available in Open MPI when compiled without thread support, as the
> only thing that "protects" the attributes is that global mutex.
> >
> > Mmmm. Ok, I see your point. But this is a (very) slippery slope.
> >
> > > For example, in ompi_attr_delete_all(), it gets the count of all
> attributes and then loops  times to delete each attribute. But
> each attribute callback can now insert or delete attributes on that
> entity. This can mean that the loop can either fail to delete an
> attribute (because some attribute callback already deleted it) or fail
> to delete *all* attributes (because some attribute callback added more).
> > >
> > > To be extremely precise the deletion part is always correct
> >
> > ...as long as the hash map is not altered from the application
> (e.g., by adding or deleting another attribute during a callback).
> >
> > I understand that you mention above that you're not worried about
> this case. I'm just picking here because there is quite definitely a
> case where the loop is *not* correct. PETSc apparently doesn't trigger
> this badness, but... like I said above, it's a (very) slippery slope.
> >
> > > as it copies the values to be deleted into a temporary array
> before calling any callbacks (and before releasing the mutex), so we
> only remove what was in the object attribute hash when the function
> was called. Don't misunderstand we have an extremely good reason to do
> it this way, we need to call the callbacks in the order in which they
> were created (mandated by the MPI standard).
> > >
> > > ompi_attr_copy_all() has similar problems -- in general, the hash
> that it is looping over can change underneath it.
> > >
> > > For the copy it is a little bit more tricky, as the calling order
> is not imposed. Our peculiar implementation of the hash table (with
> array) makes the code work, with a single (possible minor) exception
> when the hash table itself is grown between 2 calls. However, as
> stated before this issue was already present in the code in single
> threaded 

[OMPI users] 3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)

2014-12-20 Thread Sergio Manzetti
Dear Brice, the BIOS is the most latest. However, i wonder if this could be  a 
hardware error, as openmpi sources claim.  Is there any way to find out if this 
is a hardware error?

Thanks


> From: users-requ...@open-mpi.org
> Subject: users Digest, Vol 3074, Issue 1
> To: us...@open-mpi.org
> Date: Sat, 20 Dec 2014 12:00:02 -0500
> 
> Send users mailing list submissions to
>   us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>   users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
>   users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5
>   (Jeff Squyres (jsquyres))
>2. Hwloc error with Openmpi 1.8.3 on AMD 64 (Sergio Manzetti)
>3. Re: Hwloc error with Openmpi 1.8.3 on AMD 64 (Brice Goglin)
>4. best function to send data (Diego Avesani)
> 
> 
> --
> 
> Message: 1
> Date: Fri, 19 Dec 2014 19:26:58 +
> From: "Jeff Squyres (jsquyres)" 
> To: "Open MPI User's List" 
> Cc: "petsc-ma...@mcs.anl.gov" 
> Subject: Re: [OMPI users] Deadlock in OpenMPI 1.8.3 and PETSc 3.4.5
> Message-ID: <027ab453-de85-4f08-bdd7-a676ca90e...@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> On Dec 19, 2014, at 10:44 AM, George Bosilca  wrote:
> 
> > Regarding your second point, while I do tend to agree that such issue is 
> > better addressed in the MPI Forum, the last attempt to fix this was 
> > certainly not a resounding success.
> 
> Yeah, fair enough -- but it wasn't a failure, either.  It could definitely be 
> moved forward, but it will take time/effort, which I unfortunately don't 
> have. I would be willing, however, to spin up someone who *does* have 
> time/effort available to move the proposal forward.
> 
> > Indeed, there is a slight window of opportunity for inconsistencies in the 
> > recursive behavior.
> 
> You're right; it's a small window in the threading case, but a) that's the 
> worst kind :-), and b) the non-threaded case is actually worse (because the 
> global state can change from underneath the loop).
> 
> > But the inconsistencies were already in the code, especially in the single 
> > threaded case. As we never received any complaints related to this topic I 
> > did not deemed interesting to address them with my last commit. Moreover, 
> > the specific behavior needed by PETSc is available in Open MPI when 
> > compiled without thread support, as the only thing that "protects" the 
> > attributes is that global mutex.
> 
> Mmmm.  Ok, I see your point.  But this is a (very) slippery slope.
> 
> > For example, in ompi_attr_delete_all(), it gets the count of all attributes 
> > and then loops  times to delete each attribute.  But each attribute 
> > callback can now insert or delete attributes on that entity.  This can mean 
> > that the loop can either fail to delete an attribute (because some 
> > attribute callback already deleted it) or fail to delete *all* attributes 
> > (because some attribute callback added more).
> > 
> > To be extremely precise the deletion part is always correct
> 
> ...as long as the hash map is not altered from the application (e.g., by 
> adding or deleting another attribute during a callback).
> 
> I understand that you mention above that you're not worried about this case.  
> I'm just picking here because there is quite definitely a case where the loop 
> is *not* correct.  PETSc apparently doesn't trigger this badness, but... like 
> I said above, it's a (very) slippery slope.
> 
> > as it copies the values to be deleted into a temporary array before calling 
> > any callbacks (and before releasing the mutex), so we only remove what was 
> > in the object attribute hash when the function was called. Don't 
> > misunderstand we have an extremely good reason to do it this way, we need 
> > to call the callbacks in the order in which they were created (mandated by 
> > the MPI standard).
> >  
> > ompi_attr_copy_all() has similar problems -- in general, the hash that it 
> > is looping over can change underneath it.
> > 
> > For the copy it is a little bit more tricky, as the calling order is not 
> > imposed. Our peculiar implementation of the hash table (with array) makes 
> > the code work, with a single (possible minor) exception when the hash table 
> > itself is grown between 2 calls. However, as stated before this issue was 
> > already present in the code in single threaded cases for years. Addressing 
> > it is another 2 line patch, but I leave this exercise to an interested 
> > reader.
> 
> Yeah, thanks for that.  :-)
> 
> To