Ralph and George,
here are attached two patches :
- heterogeneous.v1.patch : a cleanup of the previous patch
- heterogeneous.v2.patch : a new patch based on Ralph suggestion. i made
the minimal changes to move jobid and vpid into the OPAL layer.
Cheers,
Gilles
On 2014/08/07 11:27, Ralph Castain
Are we maybe approaching this from the wrong direction? I ask because we had to
do some gyrations in the pmix framework to work around the difference in naming
schemes between OPAL and the rest of the code base, and now we have more
gyrations here.
Given that the MPI and RTE layers both rely on
Gilles,
This looks right. It is really unfortunately that we have to change the
definition of orte_process_name_t for big endian architectures, but I don't
think there is a way around.
Regarding your patch I have two comments:
1. There is a flagrant lack of comments ... especially on the ORTE sid
Ralph and George,
here is attached a patch that fixes the heterogeneous support without
the abstraction violation.
Cheers,
Gilles
On 2014/08/06 9:40, Gilles Gouaillardet wrote:
> hummm
>
> i intentionally did not swap the two 32 bits (!)
>
> from the top level, what we have is :
>
> typedef str
hummm
i intentionally did not swap the two 32 bits (!)
from the top level, what we have is :
typedef struct {
union {
uint64_t opal;
struct {
uint32_t jobid;
uint32_t vpid;
} orte;
} meta_process_name_t;
OPAL is agnostic about jobid and vpid.
jobid an
Ah yes, so it is - sorry I missed that last test :-/
On Aug 5, 2014, at 10:50 AM, George Bosilca wrote:
> The code committed by Gilles is correctly protected for big endian
> (https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely pointing
> out that I think he should also swap the
The code committed by Gilles is correctly protected for big endian (
https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely pointing
out that I think he should also swap the 2 32 bits in his implementation.
George.
On Tue, Aug 5, 2014 at 1:32 PM, Ralph Castain wrote:
>
> On Aug 5
On Aug 5, 2014, at 10:23 AM, George Bosilca wrote:
> On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain wrote:
> Hmmm...wouldn't that then require that you know (a) the other side is little
> endian, and (b) that you are on a big endian? Otherwise, you wind up with the
> same issue in reverse, yes
On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain wrote:
> Hmmm...wouldn't that then require that you know (a) the other side is
> little endian, and (b) that you are on a big endian? Otherwise, you wind up
> with the same issue in reverse, yes?
>
This is similar to the 32 bits ntohl that we are usi
Hmmm...wouldn't that then require that you know (a) the other side is little
endian, and (b) that you are on a big endian? Otherwise, you wind up with the
same issue in reverse, yes?
In the ORTE methods, we explicitly set the fields (e.g., jobid =
ntohl(remote-jobid)) to get around this problem
Technically speaking, converting a 64 bits to a big endian representation
requires the swap of the 2 32 bits parts. So the correct approach would
have been:
uint64_t htonll(uint64_t v)
{
return uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32));
}
George.
On Tue, Aug 5, 2014 at 5:52
FWIW: that's exactly how we do it in ORTE
On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet
wrote:
> George,
>
> i confirm there was a problem when running on an heterogeneous cluster,
> this is now fixed in r32425.
>
> i am not convinced i chose the most elegant way to achieve the desired res
George,
i confirm there was a problem when running on an heterogeneous cluster,
this is now fixed in r32425.
i am not convinced i chose the most elegant way to achieve the desired
result ...
could you please double check this commit ?
Thanks,
Gilles
On 2014/08/02 0:14, George Bosilca wrote:
>
Gilles,
The design of the BTL move was to let the opal_process_name_t be agnostic to
what is stored inside, and all accesses should be done through the provided
accessors. Thus, big endian or little endian doesn’t make a difference, as long
as everything goes through the accessors.
I’m skeptic
Oh, should point out: I didn't deal with the potential btl/tcp issue you noted
- I defer that to George
On Aug 1, 2014, at 7:56 AM, Ralph Castain wrote:
> Hi Gilles
>
> I'm not sure if we have a problem or not - we'll have to wait and see, I
> guess. So far, I'm not seeing any problems on x8
Hi Gilles
I'm not sure if we have a problem or not - we'll have to wait and see, I guess.
So far, I'm not seeing any problems on x86 archs, but that's to be expected and
I don't have access to anything else.
I fixed the issues you noted plus a few others I found. I imagine we'll
discover more
one last point :
in orte_process_name_t, jobid and vpid have type orte_jobid_t and
orte_vpid_t which really is uint32_t.
in orte/util/proc.c, the function pointers opal_process_name_vpid and
opal_process_name_jobid
return an int32_t
should it be an uint32_t instead ?
/* and then _process_name_jo
George and Ralph,
i am very confused whether there is an issue or not.
anyway, today Paul and i ran basic tests on big endian machines and did
not face any issue related to big endianness.
so i made my homework, digged into the code, and basically,
opal_process_name_t is used as an orte_process
The underlying structure changed, so a little bit of fiddling is normal.
Instead of using a field in the ompi_proc_t you are now using a field down
in opal_proc_t, a field that simply cannot have the same type as before
(orte_process_name_t).
George.
On Wed, Jul 30, 2014 at 12:19 PM, Ralph Ca
George - my point was that we regularly tested using the method in that
routine, and now we have to do something a little different. So it is an
"issue" in that we have to make changes across the code base to ensure we do
things the "new" way, that's all
On Jul 30, 2014, at 9:17 AM, George Bosi
No, this is not going to be an issue if the opal_identifier_t is used
correctly (aka only via the exposed accessors).
George.
On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain wrote:
> Yeah, my fix won't work for big endian machines - this is going to be an
> issue across the code base now, s
Yeah, my fix won't work for big endian machines - this is going to be an issue
across the code base now, so we'll have to troll and fix it. I was doing the
minimal change required to fix the trunk in the meantime.
On Jul 30, 2014, at 9:06 AM, George Bosilca wrote:
> Yes. opal_process_name_t ha
Yes. opal_process_name_t has basically no meaning by itself, it is a 64
bits storage location used by the upper layer to save some local key that
can be later used to extract information. Calling the OPAL level compare
function might be a better fit there.
George.
On Wed, Jul 30, 2014 at 11:5
Ralph,
was it really that simple ?
proc_temp->super.proc_name has type opal_process_name_t :
typedef opal_identifier_t opal_process_name_t;
typedef uint64_t opal_identifier_t;
*but*
item_ptr->peer has type orte_process_name_t :
struct orte_process_name_t {
orte_jobid_t jobid;
orte_vpid_t
devel] OMPI devel] trunk compilation errors in jenkins
I just fixed this one - all that was required was an ampersand as the name was
being passed into the function instead of a pointer to the name
r32357
On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET
mailto:gilles.gouaillar...@gmail.com>>
I just fixed this one - all that was required was an ampersand as the name was
being passed into the function instead of a pointer to the name
r32357
On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET
wrote:
> Rolf,
>
> r32353 can be seen as a suspect...
> Even if it is correct, it might have
Rolf,
r32353 can be seen as a suspect...
Even if it is correct, it might have exposed the bug discussed in #4815 even
more (e.g. we hit the bug 100% after the fix)
does the attached patch to #4815 fixes the problem ?
If yes, and if you see this issue as a showstopper, feel free to commit it and
27 matches
Mail list logo