Re: [ofa-general] synchronize commands issued to MTHCA
On Fri, Jan 04, 2008 at 02:43:57PM -0600, Yicheng Jia wrote: > > The mmiowb() is definitely necessary, because without it then commands > > were getting messed up on large Altix systems. > > I'm using Duo-core Xeon and I just grep the source of "mmiowb()" in kernel > 2.6.23 include/asm-x86_64 /io.h and found that this function does nothing > on x86_64 platform, is it true? > Yes. It's a no-op for most architectures. -- Arthur ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> I'm using Duo-core Xeon and I just grep the source of "mmiowb()" in kernel > 2.6.23 include/asm-x86_64 /io.h and found that this function does nothing > on x86_64 platform, is it true? Yes -- this is why I kept referring to large SGI Altix systems. - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> The mmiowb() is definitely necessary, because without it then commands > were getting messed up on large Altix systems. I'm using Duo-core Xeon and I just grep the source of "mmiowb()" in kernel 2.6.23 include/asm-x86_64 /io.h and found that this function does nothing on x86_64 platform, is it true? Thanks! Yicheng Roland Dreier <[EMAIL PROTECTED]> 01/02/2008 02:52 PM To Yicheng Jia <[EMAIL PROTECTED]> cc [email protected], Jack Morgenstein <[EMAIL PROTECTED]> Subject Re: [ofa-general] synchronize commands issued to MTHCA > Could you tell me what's the difference between "wmb()" and "mmiowb()". I > notice that ofa-1.3 has added "mmiowb()" at the end of mthca_cmd_post, > since "wmb()" is already called at the end of cmd_post, is "mmiowb()" > really necessary? wmb() orders writes from the same CPU -- it prevents highly out-of-order architectures from making writes visible in an order different from program order. mmiowb() orders MMIO writes between different CPUs, and prevents systems (such as SGI Altix) where the CPU fabric may reorder writes before they reach the IO bus. The mmiowb() is definitely necessary, because without it then commands were getting messed up on large Altix systems. - R. _ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _ ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> What is the call chain that calls SW2HW_MPT in this case? The SW2HW_MPT is called by mthca_mr_alloc function. In this function, It first call "mthca_alloc" to get a mr key, then "mthca_table_get" to get a mr ICM entry, then "mthca_alloc_mailbox" to alloc a block of mailbox for the command. During the procedure, the mad completion handler of " ib_mad_recv_done_handler" is also running, which processes the MAD_IFC command and sends response, they are all completed without error report. Also for your information, I'm using two Due Core Xeon CPU to run the driver. > Also are you going through the mthca_cmd_post_dbell() or mthca_cmd_post_hcr()code to write the command params to the HCA? Yes. I found there's a little difference between these two functions. There are two "wmb()" functions call in mthca_cmd_post_dbell()but only one "wmb()" in mthca_cmd_post_hcr(). Any perticular reason for it? > I think the best way to debug this would be to work directly with Mellanox to get a debug build of the HCA firmware and get definite info on why the SW2HW_MPT command is failing. Do you know who I am supposed to contact with? Thanks! Yicheng Roland Dreier <[EMAIL PROTECTED]> 01/02/2008 02:55 PM To Yicheng Jia <[EMAIL PROTECTED]> cc Jack Morgenstein <[EMAIL PROTECTED]>, [email protected] Subject Re: [ofa-general] synchronize commands issued to MTHCA > The SW2HW_MPT command is issued while UDAV table is been creating. During > the time that the driver is waiting for the completion of the command, it > does many other things: creating send mad package, posting send mad > request to the SQ and posting another receive mad request to the RQ. > There's no error report for all of these actions. However after it, the > HCA report command parameter error for the SW2HW_MPT. I doubt the problem is creating the UD address vector -- that is just shuffling some things around in the CPU's memory. It seems more likely that posting a send or receive request is messing things up somehow. What is the call chain that calls SW2HW_MPT in this case? Also are you going through the mthca_cmd_post_dbell() or mthca_cmd_post_hcr() code to write the command params to the HCA? I think the best way to debug this would be to work directly with Mellanox to get a debug build of the HCA firmware and get definite info on why the SW2HW_MPT command is failing. - R. _ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _ ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> I wouldn't think so, although I don't have full details of how your > hardware behaves to know for sure. I assume your PCI bus/memory > controller is already smart enough to deal with HCR writes being > interleaved with writes to a doorbell page from userspace, so it seems > that writes to locally attached memory should be OK too, as long as > the HCR writes are word-sized in the right order etc. For the problem I've seen, most probably the HCR writes mess up with doorbell register rings. Is it possible? The FW version I'm using is 1.1.0 without debug trace function. This problem is really hard to debug since it's real time and does not occur very oftem, and it's hard to hook up a PCIe bus analysis either since by the time the error happens, the PCIe transaction has been already done. All I get from the HCA is reporting bad parameter error. Is there any way to get more info from the HCA? Thanks! Yicheng Roland Dreier <[EMAIL PROTECTED]> 01/02/2008 12:13 PM To Jack Morgenstein <[EMAIL PROTECTED]> cc [email protected], Yicheng Jia <[EMAIL PROTECTED]> Subject Re: [ofa-general] synchronize commands issued to MTHCA > Roland, do you think that the memcpy_toio() call might mess things up? I wouldn't think so, although I don't have full details of how your hardware behaves to know for sure. I assume your PCI bus/memory controller is already smart enough to deal with HCR writes being interleaved with writes to a doorbell page from userspace, so it seems that writes to locally attached memory should be OK too, as long as the HCR writes are word-sized in the right order etc. > Maybe we need "wmb()" or "mmiowb()" here as well? I don't see any reason, although I often miss things. It seems that the only thing that cares about the writes of the address info being done would be posting a send WQE that uses it, and that should already have sufficient ordering. What would we be ordering things against? - R. _ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _ ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> The SW2HW_MPT command is issued while UDAV table is been creating. During > the time that the driver is waiting for the completion of the command, it > does many other things: creating send mad package, posting send mad > request to the SQ and posting another receive mad request to the RQ. > There's no error report for all of these actions. However after it, the > HCA report command parameter error for the SW2HW_MPT. I doubt the problem is creating the UD address vector -- that is just shuffling some things around in the CPU's memory. It seems more likely that posting a send or receive request is messing things up somehow. What is the call chain that calls SW2HW_MPT in this case? Also are you going through the mthca_cmd_post_dbell() or mthca_cmd_post_hcr() code to write the command params to the HCA? I think the best way to debug this would be to work directly with Mellanox to get a debug build of the HCA firmware and get definite info on why the SW2HW_MPT command is failing. - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> Could you tell me what's the difference between "wmb()" and "mmiowb()". I > notice that ofa-1.3 has added "mmiowb()" at the end of mthca_cmd_post, > since "wmb()" is already called at the end of cmd_post, is "mmiowb()" > really necessary? wmb() orders writes from the same CPU -- it prevents highly out-of-order architectures from making writes visible in an order different from program order. mmiowb() orders MMIO writes between different CPUs, and prevents systems (such as SGI Altix) where the CPU fabric may reorder writes before they reach the IO bus. The mmiowb() is definitely necessary, because without it then commands were getting messed up on large Altix systems. - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
Hi Roland, Could you tell me what's the difference between "wmb()" and "mmiowb()". I notice that ofa-1.3 has added "mmiowb()" at the end of mthca_cmd_post, since "wmb()" is already called at the end of cmd_post, is "mmiowb()" really necessary? Thanks! Yicheng Roland Dreier <[EMAIL PROTECTED]> 01/02/2008 12:13 PM To Jack Morgenstein <[EMAIL PROTECTED]> cc [email protected], Yicheng Jia <[EMAIL PROTECTED]> Subject Re: [ofa-general] synchronize commands issued to MTHCA > Roland, do you think that the memcpy_toio() call might mess things up? I wouldn't think so, although I don't have full details of how your hardware behaves to know for sure. I assume your PCI bus/memory controller is already smart enough to deal with HCR writes being interleaved with writes to a doorbell page from userspace, so it seems that writes to locally attached memory should be OK too, as long as the HCR writes are word-sized in the right order etc. > Maybe we need "wmb()" or "mmiowb()" here as well? I don't see any reason, although I often miss things. It seems that the only thing that cares about the writes of the address info being done would be posting a send WQE that uses it, and that should already have sufficient ordering. What would we be ordering things against? - R. _ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _ ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> Roland, do you think that the memcpy_toio() call might mess things up? I wouldn't think so, although I don't have full details of how your hardware behaves to know for sure. I assume your PCI bus/memory controller is already smart enough to deal with HCR writes being interleaved with writes to a doorbell page from userspace, so it seems that writes to locally attached memory should be OK too, as long as the HCR writes are word-sized in the right order etc. > Maybe we need "wmb()" or "mmiowb()" here as well? I don't see any reason, although I often miss things. It seems that the only thing that cares about the writes of the address info being done would be posting a send WQE that uses it, and that should already have sufficient ordering. What would we be ordering things against? - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> Actually I'm working on porting IB driver to QNX platform. I see. My opinion is that in the long term, you're better off writing a "native" QNX driver rather than trying to port a driver from another OS, although I understand that sometimes short-term issues make doing the right thing impossible. > However I still get a command exec error which I believe is relevant to > command synchronization. The problem is when "Created UDAV" is called > during SW2HW_MPT command is being executed, the SW2HW_MPT command would > return with bad parameter error. Here are my debug trace output: No idea really. Does the Linux mthca work on the same hardware? If so I guess you would have to figure out how the behavior of your driver is different. If you don't have Linux running on your platform then you just need to debug the driver/hardware ... perhaps hardware bus analysis would be helpful to understand what's happening. - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
Hi Jack, Thanks for your reply. The HCA I'm using is memory free, the chip is MT25204 and the HCA type is arbel, so it doesn't go through the "if (ah->type == MTHCA_AH_ON_HCA)" part of code. By checking the debug output, I got more details about this problem: The SW2HW_MPT command is issued while UDAV table is been creating. During the time that the driver is waiting for the completion of the command, it does many other things: creating send mad package, posting send mad request to the SQ and posting another receive mad request to the RQ. There's no error report for all of these actions. However after it, the HCA report command parameter error for the SW2HW_MPT. I've copied a snippet context of the debug trace output when this error happens, hopefully it will help spot the reason. 139903841835 HCR CMD: op_code: LE: d 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler 139903890876 HCR CMD: in_param_h: LE: 0 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler 139903993296 HCR CMD: in_param_l: LE: cf616000 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc 139904094753 HCR CMD: input_modifier: LE: 1e 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create MTHCA DBG: Created UDAV at 8075220/: 139904197065 HCR CMD: out_pram_h: LE: 0 13990443 [ 0] 0105 139904384499 HCR CMD: out_pram_l: LE: 0 139904428086 [ 4] 139904478675 HCR CMD: token:LE: 139904520156 [ 8] 3000 139904572059 HCR CMD: op_code_modifier: LE: 0 139904612802 [ c] 139904667693 HCR CMD: event:LE: 0 139904708526 [10] 139904758422 HCR CMD 0x18h: LE=8d, BE=d008000 139904799210 [14] 139904904204 [18] 139904946792MTHCA DBG: HCR_STATUS 40100698= d008000 ? 8000 [1c] 0002 139905076860 TRACE: mthca_av.c:235/mthca_create_ah 139905112329 TRACE: mthca_av.c:243/mthca_create_ah 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create 636959 DEBUG: Start mthca_arbel_post_send. qp 0 wr 8d984b8 139905324432 TRACE: mthca_qp.c:1911/mthca_arbel_post_send 139905359505 TRACE: mthca_qp.c:1939/mthca_arbel_post_send 139905418932 TRACE: mthca_qp.c:1949/mthca_arbel_post_send 636959 DEBUG: qp is not direct access and wqe: 0x8d84400 139905541467 TRACE: mthca_qp.c:1954/mthca_arbel_post_send 139905577647 TRACE: mthca_qp.c:1964/mthca_arbel_post_send 139905614565 TRACE: mthca_qp.c:2057/mthca_arbel_post_send 139905669411 TRACE: mthca_qp.c:2076/mthca_arbel_post_send 139905705726 TRACE: mthca_qp.c:2078/mthca_arbel_post_send 636959 DEBUG: wr sg length 0x18, lkey 0x80001900, local addr 0xce2393b8 139905831060 TRACE: mthca_qp.c:2078/mthca_arbel_post_send 636959 DEBUG: wr sg length 0xe8, lkey 0x80001900, local addr 0xce2393d0 139905956322 TRACE: mthca_qp.c:2092/mthca_arbel_post_send 636959 DEBUG: wr id 148473016 139906069875 TRACE: mthca_qp.c:2120/mthca_arbel_post_send 139906106379 TRACE: mthca_qp.c:2128/mthca_arbel_post_send 139906142892 TRACE: mthca_qp.c:2131/mthca_arbel_post_send 139906178640 TRACE: mthca_qp.c:2135/mthca_arbel_post_send 139906214703 TRACE: mthca_qp.c:2158/mthca_arbel_post_send 139906250568 TRACE: mthca_qp.c:2160/mthca_arbel_post_send 636959 DEBUG: End mthca_arbel_post_send. err 0 139906369953 TRACE: mad.c:650/ib_mad_recv_done_handler 139906406295 TRACE: mad.c:669/ib_mad_recv_done_handler 139906441539 TRACE: mad.c:672/ib_mad_recv_done_handler 636959 QNX DBG: mad_priv->header.mad_list.mad_queue->list.prev 88b0a2c 139906578384 TRACE: mthca_qp.c:2177/mthca_arbel_post_receive 139906614168 TRACE: mthca_qp.c:2194/mthca_arbel_post_receive 139906649295 TRACE: mthca_qp.c:2196/mthca_arbel_post_receive 139906689129 TRACE: mad.c:674/ib_mad_recv_done_handler 139906723068 TRACE: mad.c:676/ib_mad_recv_done_handler 636959 QNX DBG: kmem_cache 5 free object=88b0724 139906793007 HCR CMD: Status Return: : 3 Again, thanks for your help! Best, Yicheng Jack Morgenstein <[EMAIL PROTECTED]> 01/01/2008 01:03 AM To [email protected] cc Yicheng Jia <[EMAIL PROTECTED]>, Roland Dreier <[EMAIL PROTECTED]> Subject Re: [ofa-general] synchronize commands issued to MTHCA On Tuesday 01 January 2008 03:02, Yicheng Jia wrote: Does your HCA use on-board memory? (Run: "lspci" and look at "Mellanox" lines. You have on-board memory if you see either: PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA bridge (rev a1) InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA (rev a1) OR: InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) ) In that case, when you create an AH in kernel space (file mthca_av.c, procedure mthca_create_ah() ), you will enter the following flow: if (ah->type == MTHCA_AH_ON_HCA) { memcpy_toio(dev->av_table.av_ma
Re: [ofa-general] synchronize commands issued to MTHCA
On Tuesday 01 January 2008 03:02, Yicheng Jia wrote:
Does your HCA use on-board memory?
(Run: "lspci" and look at "Mellanox" lines. You have on-board memory
if you see either:
PCI bridge: Mellanox Technologies MT23108 InfiniHost HCA bridge (rev a1)
InfiniBand: Mellanox Technologies MT23108 InfiniHost HCA (rev a1)
OR:
InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor
compatibility mode)
)
In that case, when you create an AH in kernel space
(file mthca_av.c, procedure mthca_create_ah() ), you will enter the following
flow:
if (ah->type == MTHCA_AH_ON_HCA) {
memcpy_toio(dev->av_table.av_map + index * MTHCA_AV_SIZE,
av, MTHCA_AV_SIZE);
kfree(av);
}
Roland, do you think that the memcpy_toio() call might mess things up?
Maybe we need "wmb()" or "mmiowb()" here as well?
- Jack
> Hi Roland,
>
> Thanks for your reply!
>
> Actually I'm working on porting IB driver to QNX platform. I resume the
> work started by my former colleague, and I just found that the sync codes
> (dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown
> reason. After adding back these sync codes, the driver runs much
> smoothlier.
>
> However I still get a command exec error which I believe is relevant to
> command synchronization. The problem is when "Created UDAV" is called
> during SW2HW_MPT command is being executed, the SW2HW_MPT command would
> return with bad parameter error. Here are my debug trace output:
>
> 139903841835 HCR CMD: op_code: LE: d
> 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler
> 139903890876 HCR CMD: in_param_h: LE: 0
> 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler
> 139903993296 HCR CMD: in_param_l: LE: cf616000
> 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc
> 139904094753 HCR CMD: input_modifier: LE: 1e
> 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create
> MTHCA DBG: Created UDAV at 8075220/:
> 139904197065 HCR CMD: out_pram_h: LE: 0
> 13990443 [ 0] 0105
> 139904384499 HCR CMD: out_pram_l: LE: 0
> 139904428086 [ 4]
> 139904478675 HCR CMD: token:LE:
> 139904520156 [ 8] 3000
> 139904572059 HCR CMD: op_code_modifier: LE: 0
> 139904612802 [ c]
> 139904667693 HCR CMD: event:LE: 0
> 139904708526 [10]
> 139904758422 HCR CMD 0x18h: LE=8d, BE=d008000
> 139904799210 [14]
> 139904904204 [18]
> 139904946792MTHCA DBG: HCR_STATUS 40100698= d008000 ?
> 8000
>[1c] 0002
> 139905076860 TRACE: mthca_av.c:235/mthca_create_ah
> 139905112329 TRACE: mthca_av.c:243/mthca_create_ah
> 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create
>
> 139906793007 HCR CMD: Status Return: : 3
>
> Do you have any idea?
>
> Thanks and have a good new year!
> Yicheng
>
>
>
>
> Roland Dreier <[EMAIL PROTECTED]>
> 12/28/2007 11:39 PM
>
> To
> Yicheng Jia <[EMAIL PROTECTED]>
> cc
> [email protected]
> Subject
> Re: [ofa-general] synchronize commands issued to MTHCA
>
>
>
>
>
>
> > I'm using OFED-1.0 and the problem I believe is related to command
> > synchronization of HCA. The host issues a MAD_INF command at first and
> > then a SW2HW_MTP command without waiting for the completion of the
> first
> > command. Both of commands return with bad parameters error.
>
> I guess you mean the MAD_IFC and SW2HW_MPT commands? I've never heard
> of a problem like that -- more details about your hardware/software
> config and the exact symptoms you see would be helpful in debugging.
>
> Anyway OFED 1.0 is ancient by now -- you are much better off just
> using drivers from the standard kernel. If you must use OFED, then
> OFED 1.2 or even a 1.3 prerelease would be better.
>
> > My question is why there's no synchronization mechanism for the command
>
> > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize
> > between every command?
>
> The HCA firmware allows multiple commands to be queued. The
> dev->cmd.event_sem semaphore is used to limit the number of
> outstanding commands to the HCA's capabilities, and the
> dev->cmd.hcr_mutex mutex is used to serialize the actual writing of
> commands to the HCA.
>
> There was a mmiowb() added to mthca_cmd_post() fairly recently that
> might fix your problems if you are running on a large SGI Altix system.
>
> - R.
>
> __
Re: [ofa-general] synchronize commands issued to MTHCA
Hi Roland, Thanks for your reply! Actually I'm working on porting IB driver to QNX platform. I resume the work started by my former colleague, and I just found that the sync codes (dev->cmd.poll_sem and dev->cmd.hcr_mutex) were deleted for unknown reason. After adding back these sync codes, the driver runs much smoothlier. However I still get a command exec error which I believe is relevant to command synchronization. The problem is when "Created UDAV" is called during SW2HW_MPT command is being executed, the SW2HW_MPT command would return with bad parameter error. Here are my debug trace output: 139903841835 HCR CMD: op_code: LE: d 139903861104 TRACE: mad.c:639/ib_mad_recv_done_handler 139903890876 HCR CMD: in_param_h: LE: 0 139903942869 TRACE: mad.c:644/ib_mad_recv_done_handler 139903993296 HCR CMD: in_param_l: LE: cf616000 139904038413 TRACE: verbs.c:182/ib_create_ah_from_wc 139904094753 HCR CMD: input_modifier: LE: 1e 139904139150 TRACE: mthca_provider.c:447/mthca_ah_create MTHCA DBG: Created UDAV at 8075220/: 139904197065 HCR CMD: out_pram_h: LE: 0 13990443 [ 0] 0105 139904384499 HCR CMD: out_pram_l: LE: 0 139904428086 [ 4] 139904478675 HCR CMD: token:LE: 139904520156 [ 8] 3000 139904572059 HCR CMD: op_code_modifier: LE: 0 139904612802 [ c] 139904667693 HCR CMD: event:LE: 0 139904708526 [10] 139904758422 HCR CMD 0x18h: LE=8d, BE=d008000 139904799210 [14] 139904904204 [18] 139904946792MTHCA DBG: HCR_STATUS 40100698= d008000 ? 8000 [1c] 0002 139905076860 TRACE: mthca_av.c:235/mthca_create_ah 139905112329 TRACE: mthca_av.c:243/mthca_create_ah 139905147672 TRACE: mthca_provider.c:460/mthca_ah_create 139906793007 HCR CMD: Status Return: : 3 Do you have any idea? Thanks and have a good new year! Yicheng Roland Dreier <[EMAIL PROTECTED]> 12/28/2007 11:39 PM To Yicheng Jia <[EMAIL PROTECTED]> cc [email protected] Subject Re: [ofa-general] synchronize commands issued to MTHCA > I'm using OFED-1.0 and the problem I believe is related to command > synchronization of HCA. The host issues a MAD_INF command at first and > then a SW2HW_MTP command without waiting for the completion of the first > command. Both of commands return with bad parameters error. I guess you mean the MAD_IFC and SW2HW_MPT commands? I've never heard of a problem like that -- more details about your hardware/software config and the exact symptoms you see would be helpful in debugging. Anyway OFED 1.0 is ancient by now -- you are much better off just using drivers from the standard kernel. If you must use OFED, then OFED 1.2 or even a 1.3 prerelease would be better. > My question is why there's no synchronization mechanism for the command > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize > between every command? The HCA firmware allows multiple commands to be queued. The dev->cmd.event_sem semaphore is used to limit the number of outstanding commands to the HCA's capabilities, and the dev->cmd.hcr_mutex mutex is used to serialize the actual writing of commands to the HCA. There was a mmiowb() added to mthca_cmd_post() fairly recently that might fix your problems if you are running on a large SGI Altix system. - R. _ Scanned by IBM Email Security Management Services powered by MessageLabs. For more information please visit http://www.ers.ibm.com _ ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] synchronize commands issued to MTHCA
> I'm using OFED-1.0 and the problem I believe is related to command > synchronization of HCA. The host issues a MAD_INF command at first and > then a SW2HW_MTP command without waiting for the completion of the first > command. Both of commands return with bad parameters error. I guess you mean the MAD_IFC and SW2HW_MPT commands? I've never heard of a problem like that -- more details about your hardware/software config and the exact symptoms you see would be helpful in debugging. Anyway OFED 1.0 is ancient by now -- you are much better off just using drivers from the standard kernel. If you must use OFED, then OFED 1.2 or even a 1.3 prerelease would be better. > My question is why there's no synchronization mechanism for the command > execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize > between every command? The HCA firmware allows multiple commands to be queued. The dev->cmd.event_sem semaphore is used to limit the number of outstanding commands to the HCA's capabilities, and the dev->cmd.hcr_mutex mutex is used to serialize the actual writing of commands to the HCA. There was a mmiowb() added to mthca_cmd_post() fairly recently that might fix your problems if you are running on a large SGI Altix system. - R. ___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[ofa-general] synchronize commands issued to MTHCA
Hi Folks, I'm using mellanox HCA and I'm a newbie to this IB community. I'm encountering a problem with the HCA board and looking forward to getting some help here. I'm using OFED-1.0 and the problem I believe is related to command synchronization of HCA. The host issues a MAD_INF command at first and then a SW2HW_MTP command without waiting for the completion of the first command. Both of commands return with bad parameters error. My question is why there's no synchronization mechanism for the command execution on HCA, can I use "spin_lock" or "sem_wait" to synchronize between every command? Thanks! Yicheng___ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
