Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-04 Thread akepner
On Fri, Jan 04, 2008 at 02:43:57PM -0600, Yicheng Jia wrote:
> > The mmiowb() is definitely necessary, because without it then commands
> > were getting messed up on large Altix systems.
> 
> I'm using Duo-core Xeon and I just grep the source of "mmiowb()" in kernel 
> 2.6.23 include/asm-x86_64 /io.h and found that this function does nothing 
> on x86_64 platform, is it true?
> 

Yes. It's a no-op for most architectures.

-- 
Arthur

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-04 Thread Roland Dreier
 > I'm using Duo-core Xeon and I just grep the source of "mmiowb()" in kernel 
 > 2.6.23 include/asm-x86_64 /io.h and found that this function does nothing 
 > on x86_64 platform, is it true?

Yes -- this is why I kept referring to large SGI Altix systems.

 - R.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-04 Thread Yicheng Jia
> The mmiowb() is definitely necessary, because without it then commands
> were getting messed up on large Altix systems.

I'm using Duo-core Xeon and I just grep the source of "mmiowb()" in kernel 
2.6.23 include/asm-x86_64 /io.h and found that this function does nothing 
on x86_64 platform, is it true?

Thanks!
Yicheng




Roland Dreier <[EMAIL PROTECTED]> 
01/02/2008 02:52 PM

To
Yicheng Jia <[EMAIL PROTECTED]>
cc
[email protected], Jack Morgenstein <[EMAIL PROTECTED]>
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






 > Could you tell me what's the difference between "wmb()" and "mmiowb()". 
I 
 > notice that ofa-1.3 has added "mmiowb()" at the end of mthca_cmd_post, 
 > since "wmb()" is already called at the end of cmd_post, is "mmiowb()" 
 > really necessary?

wmb() orders writes from the same CPU -- it prevents highly
out-of-order architectures from making writes visible in an order
different from program order.  mmiowb() orders MMIO writes between
different CPUs, and prevents systems (such as SGI Altix) where the CPU
fabric may reorder writes before they reach the IO bus.

The mmiowb() is definitely necessary, because without it then commands
were getting messed up on large Altix systems.

 - R.

_
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Yicheng Jia
> What is the call chain that calls SW2HW_MPT in this case?
The SW2HW_MPT is called by mthca_mr_alloc function. In this function, It 
first call "mthca_alloc" to get a mr key, then "mthca_table_get" to get a 
mr ICM entry, then "mthca_alloc_mailbox" to alloc a block of mailbox for 
the command. During the procedure, the mad completion handler of "
ib_mad_recv_done_handler" is also running, which processes the MAD_IFC 
command and sends response, they are all completed without error report. 
Also for your information, I'm using two Due Core Xeon CPU to run the 
driver.

> Also are you going through the mthca_cmd_post_dbell() or 
mthca_cmd_post_hcr()code to write the command params to the HCA?
Yes. I found there's a little difference between these two functions. 
There are two "wmb()" functions call in mthca_cmd_post_dbell()but only one 
"wmb()" in mthca_cmd_post_hcr(). Any perticular reason for it? 

> I think the best way to debug this would be to work directly with 
Mellanox to get a debug build of the HCA firmware and get definite info on 
why the SW2HW_MPT command is failing.
Do you know who I am supposed to contact with?

Thanks!
Yicheng




Roland Dreier <[EMAIL PROTECTED]> 
01/02/2008 02:55 PM

To
Yicheng Jia <[EMAIL PROTECTED]>
cc
Jack Morgenstein <[EMAIL PROTECTED]>, [email protected]
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






 > The SW2HW_MPT command is issued while UDAV table is been creating. 
During 
 > the time that the driver is waiting for the completion of the command, 
it 
 > does many other things: creating send mad package, posting send mad 
 > request to the SQ and posting another receive mad request to the RQ. 
 > There's no error report for all of these actions. However after it, the 

 > HCA report command parameter error for the SW2HW_MPT.

I doubt the problem is creating the UD address vector -- that is just
shuffling some things around in the CPU's memory.  It seems more
likely that posting a send or receive request is messing things up
somehow.  What is the call chain that calls SW2HW_MPT in this case?
Also are you going through the mthca_cmd_post_dbell() or 
mthca_cmd_post_hcr()
code to write the command params to the HCA?

I think the best way to debug this would be to work directly with
Mellanox to get a debug build of the HCA firmware and get definite
info on why the SW2HW_MPT command is failing.

 - R.

_
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Yicheng Jia
> I wouldn't think so, although I don't have full details of how your
> hardware behaves to know for sure.  I assume your PCI bus/memory
> controller is already smart enough to deal with HCR writes being
> interleaved with writes to a doorbell page from userspace, so it seems
> that writes to locally attached memory should be OK too, as long as
> the HCR writes are word-sized in the right order etc.

For the problem I've seen, most probably the HCR writes mess up with 
doorbell register rings. Is it possible? The FW version I'm using is 1.1.0 
without debug trace function. This problem is really hard to debug since 
it's real time and does not occur very oftem, and it's hard to hook up a 
PCIe bus analysis either since by the time the error happens, the PCIe 
transaction has been already done. All I get from the HCA is reporting bad 
parameter error. Is there any way to get more info from the HCA?

Thanks!
Yicheng




Roland Dreier <[EMAIL PROTECTED]> 
01/02/2008 12:13 PM

To
Jack Morgenstein <[EMAIL PROTECTED]>
cc
[email protected], Yicheng Jia <[EMAIL PROTECTED]>
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






 > Roland, do you think that the memcpy_toio() call might mess things up?

I wouldn't think so, although I don't have full details of how your
hardware behaves to know for sure.  I assume your PCI bus/memory
controller is already smart enough to deal with HCR writes being
interleaved with writes to a doorbell page from userspace, so it seems
that writes to locally attached memory should be OK too, as long as
the HCR writes are word-sized in the right order etc.

 > Maybe we need "wmb()" or "mmiowb()" here as well?

I don't see any reason, although I often miss things.  It seems that
the only thing that cares about the writes of the address info being
done would be posting a send WQE that uses it, and that should already
have sufficient ordering.  What would we be ordering things against?

 - R.

_
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Roland Dreier
 > The SW2HW_MPT command is issued while UDAV table is been creating. During 
 > the time that the driver is waiting for the completion of the command, it 
 > does many other things: creating send mad package, posting send mad 
 > request to the SQ and posting another receive mad request to the RQ. 
 > There's no error report for all of these actions. However after it, the 
 > HCA report command parameter error for the SW2HW_MPT.

I doubt the problem is creating the UD address vector -- that is just
shuffling some things around in the CPU's memory.  It seems more
likely that posting a send or receive request is messing things up
somehow.  What is the call chain that calls SW2HW_MPT in this case?
Also are you going through the mthca_cmd_post_dbell() or mthca_cmd_post_hcr()
code to write the command params to the HCA?

I think the best way to debug this would be to work directly with
Mellanox to get a debug build of the HCA firmware and get definite
info on why the SW2HW_MPT command is failing.

 - R.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Roland Dreier
 > Could you tell me what's the difference between "wmb()" and "mmiowb()". I 
 > notice that ofa-1.3 has added "mmiowb()" at the end of mthca_cmd_post, 
 > since "wmb()" is already called at the end of cmd_post, is "mmiowb()" 
 > really necessary?

wmb() orders writes from the same CPU -- it prevents highly
out-of-order architectures from making writes visible in an order
different from program order.  mmiowb() orders MMIO writes between
different CPUs, and prevents systems (such as SGI Altix) where the CPU
fabric may reorder writes before they reach the IO bus.

The mmiowb() is definitely necessary, because without it then commands
were getting messed up on large Altix systems.

 - R.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Yicheng Jia
Hi Roland,

Could you tell me what's the difference between "wmb()" and "mmiowb()". I 
notice that ofa-1.3 has added "mmiowb()" at the end of mthca_cmd_post, 
since "wmb()" is already called at the end of cmd_post, is "mmiowb()" 
really necessary?

Thanks!
Yicheng




Roland Dreier <[EMAIL PROTECTED]> 
01/02/2008 12:13 PM

To
Jack Morgenstein <[EMAIL PROTECTED]>
cc
[email protected], Yicheng Jia <[EMAIL PROTECTED]>
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






 > Roland, do you think that the memcpy_toio() call might mess things up?

I wouldn't think so, although I don't have full details of how your
hardware behaves to know for sure.  I assume your PCI bus/memory
controller is already smart enough to deal with HCR writes being
interleaved with writes to a doorbell page from userspace, so it seems
that writes to locally attached memory should be OK too, as long as
the HCR writes are word-sized in the right order etc.

 > Maybe we need "wmb()" or "mmiowb()" here as well?

I don't see any reason, although I often miss things.  It seems that
the only thing that cares about the writes of the address info being
done would be posting a send WQE that uses it, and that should already
have sufficient ordering.  What would we be ordering things against?

 - R.

_
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Roland Dreier
 > Roland, do you think that the memcpy_toio() call might mess things up?

I wouldn't think so, although I don't have full details of how your
hardware behaves to know for sure.  I assume your PCI bus/memory
controller is already smart enough to deal with HCR writes being
interleaved with writes to a doorbell page from userspace, so it seems
that writes to locally attached memory should be OK too, as long as
the HCR writes are word-sized in the right order etc.

 > Maybe we need "wmb()" or "mmiowb()" here as well?

I don't see any reason, although I often miss things.  It seems that
the only thing that cares about the writes of the address info being
done would be posting a send WQE that uses it, and that should already
have sufficient ordering.  What would we be ordering things against?

 - R.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Roland Dreier
 > Actually I'm working on porting IB driver to QNX platform.

I see.  My opinion is that in the long term, you're better off writing
a "native" QNX driver rather than trying to port a driver from another
OS, although I understand that sometimes short-term issues make doing
the right thing impossible.

 > However I still get a command exec error which I believe is relevant to 
 > command synchronization. The problem is when "Created UDAV" is called 
 > during SW2HW_MPT command is being executed, the SW2HW_MPT command would 
 > return with bad parameter error. Here are my debug trace output:

No idea really.  Does the Linux mthca work on the same hardware?  If
so I guess you would have to figure out how the behavior of your
driver is different.  If you don't have Linux running on your platform
then you just need to debug the driver/hardware ... perhaps hardware
bus analysis would be helpful to understand what's happening.

 - R.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [ofa-general] synchronize commands issued to MTHCA

2008-01-02 Thread Yicheng Jia
Hi Jack,

Thanks for your reply. The HCA I'm using is memory free, the chip is 
MT25204 and the HCA type is arbel, so it doesn't go through the "if 
(ah->type == MTHCA_AH_ON_HCA)" part of code. By checking the debug output, 
I got more details about this problem:

The SW2HW_MPT command is issued while UDAV table is been creating. During 
the time that the driver is waiting for the completion of the command, it 
does many other things: creating send mad package, posting send mad 
request to the SQ and posting another receive mad request to the RQ. 
There's no error report for all of these actions. However after it, the 
HCA report command parameter error for the SW2HW_MPT.

I've copied a snippet context of the debug trace output when this error 
happens, hopefully it will help spot the reason.

139903841835 HCR CMD: op_code:  LE: d
139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
139903890876 HCR CMD: in_param_h:   LE: 0
139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
139903993296 HCR CMD: in_param_l:   LE: cf616000
139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
139904094753 HCR CMD: input_modifier:   LE: 1e
139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
MTHCA DBG:  Created UDAV at 8075220/:
139904197065 HCR CMD: out_pram_h:   LE: 0
13990443   [ 0] 0105
139904384499 HCR CMD: out_pram_l:   LE: 0
139904428086   [ 4] 
139904478675 HCR CMD: token:LE: 
139904520156   [ 8] 3000
139904572059 HCR CMD: op_code_modifier: LE: 0
139904612802   [ c] 
139904667693 HCR CMD: event:LE: 0
139904708526   [10] 
139904758422 HCR CMD 0x18h: LE=8d, BE=d008000
139904799210   [14] 
139904904204   [18] 
139904946792MTHCA DBG:  HCR_STATUS 40100698= d008000 ? 
8000
   [1c] 0002
139905076860 TRACE: mthca_av.c:235/mthca_create_ah
139905112329 TRACE: mthca_av.c:243/mthca_create_ah
139905147672 TRACE: mthca_provider.c:460/mthca_ah_create
636959 DEBUG:  Start mthca_arbel_post_send. qp 0 wr 
8d984b8 
139905324432 TRACE: mthca_qp.c:1911/mthca_arbel_post_send
139905359505 TRACE: mthca_qp.c:1939/mthca_arbel_post_send
139905418932 TRACE: mthca_qp.c:1949/mthca_arbel_post_send
636959 DEBUG:  qp is not direct access and wqe: 0x8d84400 

139905541467 TRACE: mthca_qp.c:1954/mthca_arbel_post_send
139905577647 TRACE: mthca_qp.c:1964/mthca_arbel_post_send
139905614565 TRACE: mthca_qp.c:2057/mthca_arbel_post_send
139905669411 TRACE: mthca_qp.c:2076/mthca_arbel_post_send
139905705726 TRACE: mthca_qp.c:2078/mthca_arbel_post_send
636959 DEBUG:  wr sg length 0x18, lkey 0x80001900, local 
addr 0xce2393b8
139905831060 TRACE: mthca_qp.c:2078/mthca_arbel_post_send
636959 DEBUG:  wr sg length 0xe8, lkey 0x80001900, local 
addr 0xce2393d0
139905956322 TRACE: mthca_qp.c:2092/mthca_arbel_post_send
636959 DEBUG:  wr id 148473016
139906069875 TRACE: mthca_qp.c:2120/mthca_arbel_post_send
139906106379 TRACE: mthca_qp.c:2128/mthca_arbel_post_send
139906142892 TRACE: mthca_qp.c:2131/mthca_arbel_post_send
139906178640 TRACE: mthca_qp.c:2135/mthca_arbel_post_send
139906214703 TRACE: mthca_qp.c:2158/mthca_arbel_post_send
139906250568 TRACE: mthca_qp.c:2160/mthca_arbel_post_send
636959 DEBUG:  End mthca_arbel_post_send. err 0
 139906369953 TRACE: mad.c:650/ib_mad_recv_done_handler
139906406295 TRACE: mad.c:669/ib_mad_recv_done_handler
139906441539 TRACE: mad.c:672/ib_mad_recv_done_handler
636959 QNX   DBG:  
mad_priv->header.mad_list.mad_queue->list.prev  88b0a2c 
139906578384 TRACE: mthca_qp.c:2177/mthca_arbel_post_receive
139906614168 TRACE: mthca_qp.c:2194/mthca_arbel_post_receive
139906649295 TRACE: mthca_qp.c:2196/mthca_arbel_post_receive
139906689129 TRACE: mad.c:674/ib_mad_recv_done_handler
139906723068 TRACE: mad.c:676/ib_mad_recv_done_handler
636959 QNX   DBG:  kmem_cache 5 free object=88b0724
139906793007 HCR CMD: Status Return:  : 3

Again, thanks for your help!

Best,
Yicheng




Jack Morgenstein <[EMAIL PROTECTED]> 
01/01/2008 01:03 AM

To
[email protected]
cc
Yicheng Jia <[EMAIL PROTECTED]>, Roland Dreier <[EMAIL PROTECTED]>
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






On Tuesday 01 January 2008 03:02, Yicheng Jia wrote:

Does your HCA use on-board memory?
(Run: "lspci" and look at "Mellanox" lines.  You have on-board memory
 if you see either:
 PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA 
bridge (rev a1)
 InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA 
(rev a1)
 OR:
   InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode)
)

In that case, when you create an AH in kernel space
(file mthca_av.c, procedure mthca_create_ah() ), you will enter the 
following flow:
 if (ah->type == MTHCA_AH_ON_HCA) {
 memcpy_toio(dev->av_table.av_ma

Re: [ofa-general] synchronize commands issued to MTHCA

2007-12-31 Thread Jack Morgenstein
On Tuesday 01 January 2008 03:02, Yicheng Jia wrote:

Does your HCA use on-board memory?
(Run: "lspci" and look at "Mellanox" lines.  You have on-board memory
 if you see either:
PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA bridge (rev a1)
InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA (rev a1)
 OR:
   InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor 
compatibility mode)
)

In that case, when you create an AH in kernel space
(file mthca_av.c, procedure mthca_create_ah() ), you will enter the following 
flow:
if (ah->type == MTHCA_AH_ON_HCA) {
memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
av, MTHCA_AV_SIZE);
kfree(av);
}

Roland, do you think that the memcpy_toio() call might mess things up?

Maybe we need "wmb()" or "mmiowb()" here as well?

- Jack

> Hi Roland,
> 
> Thanks for your reply!
> 
> Actually I'm working on porting IB driver to QNX platform. I resume the 
> work started by my former colleague, and I just found that the sync codes 
> (dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown 
> reason. After adding back these sync codes, the driver runs much 
> smoothlier. 
> 
> However I still get a command exec error which I believe is relevant to 
> command synchronization. The problem is when "Created UDAV" is called 
> during SW2HW_MPT command is being executed, the SW2HW_MPT command would 
> return with bad parameter error. Here are my debug trace output:
> 
> 139903841835 HCR CMD: op_code:  LE: d
> 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
> 139903890876 HCR CMD: in_param_h:   LE: 0
> 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
> 139903993296 HCR CMD: in_param_l:   LE: cf616000
> 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
> 139904094753 HCR CMD: input_modifier:   LE: 1e
> 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
> MTHCA DBG:  Created UDAV at 8075220/:
> 139904197065 HCR CMD: out_pram_h:   LE: 0
> 13990443   [ 0] 0105
> 139904384499 HCR CMD: out_pram_l:   LE: 0
> 139904428086   [ 4] 
> 139904478675 HCR CMD: token:LE: 
> 139904520156   [ 8] 3000
> 139904572059 HCR CMD: op_code_modifier: LE: 0
> 139904612802   [ c] 
> 139904667693 HCR CMD: event:LE: 0
> 139904708526   [10] 
> 139904758422 HCR CMD 0x18h: LE=8d, BE=d008000
> 139904799210   [14] 
> 139904904204   [18] 
> 139904946792MTHCA DBG:  HCR_STATUS 40100698= d008000 ? 
> 8000
>[1c] 0002
> 139905076860 TRACE: mthca_av.c:235/mthca_create_ah
> 139905112329 TRACE: mthca_av.c:243/mthca_create_ah
> 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create
> 
> 139906793007 HCR CMD: Status Return:  : 3
> 
> Do you have any idea?
> 
> Thanks and have a good new year!
> Yicheng
> 
> 
> 
> 
> Roland Dreier <[EMAIL PROTECTED]> 
> 12/28/2007 11:39 PM
> 
> To
> Yicheng Jia <[EMAIL PROTECTED]>
> cc
> [email protected]
> Subject
> Re: [ofa-general] synchronize commands issued to MTHCA
> 
> 
> 
> 
> 
> 
>  > I'm using OFED-1.0 and the problem I believe is related to command 
>  > synchronization of HCA. The host issues a MAD_INF command at first and 
>  > then a SW2HW_MTP command without waiting for the completion of the 
> first 
>  > command. Both of commands return with bad parameters error.
> 
> I guess you mean the MAD_IFC and SW2HW_MPT commands?  I've never heard
> of a problem like that -- more details about your hardware/software
> config and the exact symptoms you see would be helpful in debugging.
> 
> Anyway OFED 1.0 is ancient by now -- you are much better off just
> using drivers from the standard kernel.  If you must use OFED, then
> OFED 1.2 or even a 1.3 prerelease would be better.
> 
>  > My question is why there's no synchronization mechanism for the command 
> 
>  > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize 
>  > between every command?
> 
> The HCA firmware allows multiple commands to be queued.  The
> dev->cmd.event_sem semaphore is used to limit the number of
> outstanding commands to the HCA's capabilities, and the
> dev->cmd.hcr_mutex mutex is used to serialize the actual writing of
> commands to the HCA.
> 
> There was a mmiowb() added to mthca_cmd_post() fairly recently that
> might fix your problems if you are running on a large SGI Altix system.
> 
>  - R.
> 
> __

Re: [ofa-general] synchronize commands issued to MTHCA

2007-12-31 Thread Yicheng Jia
Hi Roland,

Thanks for your reply!

Actually I'm working on porting IB driver to QNX platform. I resume the 
work started by my former colleague, and I just found that the sync codes 
(dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown 
reason. After adding back these sync codes, the driver runs much 
smoothlier. 

However I still get a command exec error which I believe is relevant to 
command synchronization. The problem is when "Created UDAV" is called 
during SW2HW_MPT command is being executed, the SW2HW_MPT command would 
return with bad parameter error. Here are my debug trace output:

139903841835 HCR CMD: op_code:  LE: d
139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
139903890876 HCR CMD: in_param_h:   LE: 0
139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
139903993296 HCR CMD: in_param_l:   LE: cf616000
139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
139904094753 HCR CMD: input_modifier:   LE: 1e
139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
MTHCA DBG:  Created UDAV at 8075220/:
139904197065 HCR CMD: out_pram_h:   LE: 0
13990443   [ 0] 0105
139904384499 HCR CMD: out_pram_l:   LE: 0
139904428086   [ 4] 
139904478675 HCR CMD: token:LE: 
139904520156   [ 8] 3000
139904572059 HCR CMD: op_code_modifier: LE: 0
139904612802   [ c] 
139904667693 HCR CMD: event:LE: 0
139904708526   [10] 
139904758422 HCR CMD 0x18h: LE=8d, BE=d008000
139904799210   [14] 
139904904204   [18] 
139904946792MTHCA DBG:  HCR_STATUS 40100698= d008000 ? 
8000
   [1c] 0002
139905076860 TRACE: mthca_av.c:235/mthca_create_ah
139905112329 TRACE: mthca_av.c:243/mthca_create_ah
139905147672 TRACE: mthca_provider.c:460/mthca_ah_create

139906793007 HCR CMD: Status Return:  : 3

Do you have any idea?

Thanks and have a good new year!
Yicheng




Roland Dreier <[EMAIL PROTECTED]> 
12/28/2007 11:39 PM

To
Yicheng Jia <[EMAIL PROTECTED]>
cc
[email protected]
Subject
Re: [ofa-general] synchronize commands issued to MTHCA






 > I'm using OFED-1.0 and the problem I believe is related to command 
 > synchronization of HCA. The host issues a MAD_INF command at first and 
 > then a SW2HW_MTP command without waiting for the completion of the 
first 
 > command. Both of commands return with bad parameters error.

I guess you mean the MAD_IFC and SW2HW_MPT commands?  I've never heard
of a problem like that -- more details about your hardware/software
config and the exact symptoms you see would be helpful in debugging.

Anyway OFED 1.0 is ancient by now -- you are much better off just
using drivers from the standard kernel.  If you must use OFED, then
OFED 1.2 or even a 1.3 prerelease would be better.

 > My question is why there's no synchronization mechanism for the command 

 > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize 
 > between every command?

The HCA firmware allows multiple commands to be queued.  The
dev->cmd.event_sem semaphore is used to limit the number of
outstanding commands to the HCA's capabilities, and the
dev->cmd.hcr_mutex mutex is used to serialize the actual writing of
commands to the HCA.

There was a mmiowb() added to mthca_cmd_post() fairly recently that
might fix your problems if you are running on a large SGI Altix system.

 - R.

_
Scanned by IBM Email Security Management Services powered by MessageLabs. 
For more information please visit http://www.ers.ibm.com
_

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] synchronize commands issued to MTHCA

2007-12-28 Thread Roland Dreier
 > I'm using OFED-1.0 and the problem I believe is related to command 
 > synchronization of HCA. The host issues a MAD_INF command at first and 
 > then a SW2HW_MTP command without waiting for the completion of the first 
 > command. Both of commands return with bad parameters error.

I guess you mean the MAD_IFC and SW2HW_MPT commands?  I've never heard
of a problem like that -- more details about your hardware/software
config and the exact symptoms you see would be helpful in debugging.

Anyway OFED 1.0 is ancient by now -- you are much better off just
using drivers from the standard kernel.  If you must use OFED, then
OFED 1.2 or even a 1.3 prerelease would be better.

 > My question is why there's no synchronization mechanism for the command 
 > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize 
 > between every command?

The HCA firmware allows multiple commands to be queued.  The
dev->cmd.event_sem semaphore is used to limit the number of
outstanding commands to the HCA's capabilities, and the
dev->cmd.hcr_mutex mutex is used to serialize the actual writing of
commands to the HCA.

There was a mmiowb() added to mthca_cmd_post() fairly recently that
might fix your problems if you are running on a large SGI Altix system.

 - R.
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[ofa-general] synchronize commands issued to MTHCA

2007-12-24 Thread Yicheng Jia
Hi Folks,

I'm using mellanox HCA and I'm a newbie to this IB community. I'm 
encountering a problem with the HCA board and looking forward to getting 
some help here.

I'm using OFED-1.0 and the problem I believe is related to command 
synchronization of HCA. The host issues a MAD_INF command at first and 
then a SW2HW_MTP command without waiting for the completion of the first 
command. Both of commands return with bad parameters error.

My question is why there's no synchronization mechanism for the command 
execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize 
between every command?

Thanks!
Yicheng___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general