On Tuesday 22 September 2009 21:58, Nathan Stratton wrote:
>
> Having an issue with getting verbs working on 2.6.31. I am running Fedora
> 11 with 2.6.31 and 1.1.2-0.1.gb00dc7d libibverbs. Everything looks great
> until I run ibv_srq_pingpong to the server. It shows local/remote address
> a bun
catas_reset() uses a pointer to mthca_dev, but mthca_dev may not be valid after
the call to __mthca_restart_one().
Based on a similar patch for mlx4 by Vitaliy Gusev
Signed-off-by: Jack Morgenstein
---
Roland,
Here is the equivalent patch for mthca catas error processing. Here, also, we
On Thursday 10 September 2009 23:00, Jeremy Enos wrote:
> Fails w/ ofa_kernel like the others have... I didn't test excluding this rpm
> with FC11, but the others also fail elsewhere w/ this rpm excluded- so I'm
> guessing FC11 would as well. I included the output (and last 50 lines of
> log)
dules/2.6.18-92.1.13.el5xen/kernel/drivers/infiniband/ulp/sdp
> /lib/modules/2.6.18-92.1.13.el5xen/kernel/drivers/infiniband/ulp/sdp/ib_
> sdp.ko
> /lib/modules/2.6.18-92.el5/kernel/drivers/infiniband/ulp/sdp
> /lib/modules/2.6.18-92.el5/kernel/drivers/infiniband/ulp/sdp/ib_
;
> Configured devices:
> ib0
>
> Currently active devices:
> ib0
>
> The following OFED modules are loaded:
>
> rdma_ucm
> rdma_cm
> ib_addr
> ib_ipoib
> mlx4_core
> mlx4_ib
> ib_mthca
> ib_uverbs
> ib_umad
> ib_sa
> ib_
On Tuesday 01 September 2009 13:44, Robert Dunkley wrote:
> Hi everyone,
>
> A DRBD release candidate with specific SDP/Infiniband support was
> released last week.
>
> I have an existing OFED 1.3.1 install without the SDP protocol loaded, I
> need to add it. I still have the original source I in
> >>> I think OFED 1.5 might work on it but not sure. Which kernel version
> >>> FC10 use?
> >>> In general OFED 1.5 supports FC11
> >>>
> >> Actually, it supports FC12 (kernel 2.6.29).
> >>
> > We had originally planned to support FC11 -- however, in the interim, FC12
> > was
> > rel
On Sunday 30 August 2009 18:47, Jack Morgenstein wrote:
> On Sunday 23 August 2009 11:16, Tziporet Koren wrote:
> > Jeremy Enos wrote:
> > > Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained.
> > > No OFED support for FC10 yet creates a t
On Sunday 23 August 2009 11:16, Tziporet Koren wrote:
> Jeremy Enos wrote:
> > Coming up on a year of Fedora 10 GA... Fedora 9 no longer maintained.
> > No OFED support for FC10 yet creates a tough spot if trying to stay
> > secure. Is there *any* version (1.5, etc) that will even build on FC10?
On Wednesday 26 August 2009 02:03, MANIKANTAN KALAIYA wrote:
> Resending to the mailing list...
>
> We have Ofed1.3.1 installed, one of the sub packages is libibverbs version
> 1.1.1. We have a small program that lists the number of IB cards available in
> the system through ibv_get_device_list(
On Wednesday 26 August 2009 13:17, Sneha Mistry wrote:
> Hi,
>
> I am new be to Infiniband and trying to install OFED-1.5-alpha4 on
> opensuse 10.3 .
> Kernel version is 2.6.26-2-686 .
1. OFED 1.5 is not supported on OpenSuse 10.3 -- it is supported on OpenSuse 11.
2. You are correct in that the
device comes back up, thus preventing the above deadlock.
V2: move active flag from net to hw/mlx4, and use only for fatal event flow.
(per feedback from Roland).
V3: fixed checkpatch.pl warnings.
Signed-off-by: Jack Morgenstein
---
Roland,
Sorry about the checkpatch.pl oversight. No excuse
device comes back up, thus preventing the above deadlock.
Signed-off-by: Jack Morgenstein
---
Roland,
You are right, mthca also needs such a patch.
This will prevent user-level apps from allocating a device context following
a device internal catastrophic error.
BTW, if the administrator has d
On Tuesday 11 August 2009 19:23, Roland Dreier wrote:
>
> > this is a continuation of thread:
> > http://lists.openfabrics.org/pipermail/general/2009-July/060668.html
>
> I see you
> didn't answer the question about mthca -- does it suffer from this
> problem as well?
>
Sorry about that. Yes,
device comes back up, thus preventing the above deadlock.
V2: move active flag from net to hw/mlx4, and use only for fatal event flow.
(per feedback from Roland).
Signed-off-by: Jack Morgenstein
---
Roland,
this is a continuation of thread:
http://lists.openfabrics.org/pipermail/general/2009-Ju
On Monday 10 August 2009 20:42, Roland Dreier wrote:
>
> > I'm a bit nervous about this one.
> > printk_once will print once ONLY if CONFIG_PRINTK is set in
> include/linux/autoconf.h
> > (i.e., when the kernel is configured). Otherwise, it gets defined to
> printk --
> > and it will alwa
I'm a bit nervous about this one.
printk_once will print once ONLY if CONFIG_PRINTK is set in
include/linux/autoconf.h
(i.e., when the kernel is configured). Otherwise, it gets defined to printk --
and it will always print in this case.
(see 2.6.30.xx kernel include file "include/linux/kernel.h
On Tuesday 04 August 2009 12:58, Robert Dunkley wrote:
> I'm a bit of newbie to kernel building but work on my first custom
> kernel seems to be going well so far.
>
> The issue I have is the systems this kernel is destined for are using
> Mellanox infiniband cards, IPOIB (CM), RDMA and Subnet Ma
On Wednesday 29 July 2009 21:42, Hal Rosenstock wrote:
>
>
> I know I'm going to hear it but it's not under my control :-)
>
> It's whatever is in OFED 1.4.1. kernel is some 2.6.18 variant using
> mlx4.v1.0 (April 4, 2008) using x86_64 arch.
>
Hal,
1. Did you install userspace from OFED 1.4.1,
On Thursday 23 July 2009 21:17, Tziporet Koren wrote:
> OFED 1.5-alpha4 is available
>
> o Linux Operating Systems:
...
> - OpenSuSE 10.3:2.6.22.5-31 *
Correction: OpenSuSE 11: 2.6.25.5-1.1-default *
(OpenSuSE 10.3 is not supported under OFED 1.5)
Andy,
This snippet is from the EWG list, regarding the daily build of OFED 1.5 (which
is based on kernel 2.6.30).
Note the failure below (when compiling on kernel 2.6.26).
Please note that rds will fail in ALL backports (i.e, kernel 2.6.29 and
earlier),
because the 'DECLARE_PER_CPU_SHARED_ALIGNED
On Thursday 16 July 2009 21:08, Doug Ledford wrote:
> On rhel4 and rhel5 machines, the kmalloc implementation does not
> automatically forward kmalloc requests > 128kb to __get_free_pages.
> Please include this patch in all rhel4 and rhel5 backport directories
> so that we do the right thing
On Monday 20 July 2009 21:53, Jason Gunthorpe wrote:
> I have also patches for mlx4 and mthca to suppress the compiler
> warning that results from this patch. ipath is OK as is, and I'm not
> sure where the iwarp stuff lives..
>
Is this change really necessary?
Seems to me that you are creating c
On Wednesday 15 July 2009 01:33, Roland Dreier wrote:
> It occurs to me that one change that makes sense and would help make
> this fix cleaner is the following -- since after all if a command # is
> out of range, that's really a different error than if a low-level driver
> just doesn't implement a
On Monday 13 July 2009 19:52, pandit ib wrote:
> > Looks like the OFED installation is faulty.
>
> Can we fix this issue in the next release of OFED?
This is not an OFED issue, you need to fix your compilation script.
> > For some reason, your compilation script is not taking directory
> > /usr/
device comes back up, thus preventing the above deadlock.
Signed-off-by: Jack Morgenstein
---
Roland,
For good measure, I also set the active flag to false at mlx4_ib_remove() -- to
give some measure of protection against opening a new userspace app while the
driver
is in the process of bein
On Thursday 09 July 2009 18:10, Roland Dreier wrote:
> Or maybe it's cleaner to add a stub resize_cq method that just returns
> ENOSYS that drivers can set when they don't actually implement it...
>
Basically, that is what the patch I submitted to you does.
Its just that instead of having a differ
On Wednesday 08 July 2009 18:58, Roland Dreier wrote:
> This is kind of dopey, isn't it? Seems cleaner just to leave the
> resize_cq method unset if the hardware doesn't support it; then the core
> takes care of this check for us.
>
Not so (unfortunately). The problem is that doing it the "corre
The returned FW raw command status is invaluable in troubleshooting, and
if a FW command error status is returned, we need to be able to see it
(along with the command which caused the non-zero status).
Signed-off-by: Jack Morgenstein
diff --git a/drivers/net/mlx4/cmd.c b/drivers/net/mlx4/cmd.c
If a ConnectX card has a FW version installed which does not
support resize cq, the resize_cq command will return -ENOSYS.
Fixes Bugzilla 1415.
Signed-off-by: Jack Morgenstein
---
Roland,
I submitted this on 2008-12-03, and somehow it fell through the cracks.
I've regenerated it for you
On Tuesday 07 July 2009 15:31, Lars Ellenberg wrote:
> but I was wondering about the status of the
> git://git.openfabrics.org/ofed_1_4/linux-2.6.git tree,
> whether that is supposed to be the "most uptodate official ofed_1_4"
> kernel, and whether or not it is going to be updated to either track
>
Adding Eli Cohen (author of the huge-page support patch).
Eli, what is missing on the PPC regarding huge page support?
-Jack
On Tuesday 07 July 2009 13:54, Or Gerlitz wrote:
> Jack Morgenstein wrote:
> > Yes, see kernel_patches/fixes/mlx4_0010_add_wc.patch. With OFED 1.5, I am
>
On Tuesday 07 July 2009 13:03, Or Gerlitz wrote:
> Jack Morgenstein wrote:
> > I've been looking at the write-combining support in the 2.6.30 kernel, and
> > it looks good [...] from the write-combining support in OFED 1.4:
> >
> Hi Jack, is there some WC related
Looks like the OFED installation is faulty.
The missing function declarations are all found in header files under directory
(on your system) /usr/src/ofa_kernel/kernel_addons/backport/2.6.16_sles10_sp2.
These header files must be taken before the regular kernel header files
(they "include_next" to
Hi Roland,
I've been looking at the write-combining support in the 2.6.30 kernel, and it
looks good.
There is also a good solution for PPC write combining support in the kernel
(adding #define pgprot_writecombine pgprot_noncached_wc to file
arch/powerpc/include/asm/pgtable.h,
per e-mail corres
Resending, adding Hoang-Nam Nguyen and Christoph Raisch of IBM.
Please see the questions below. Also, who is the person at IBM who does Linux
kernel devleopment for the PPC?
Thanks!
-Jack
On Sunday 28 June 2009 12:21, Jack Morgenstein wrote:
> Hi Shirley,
>
> I was reviewing write-
On Tuesday 30 June 2009 16:27, Yossi Etigin wrote:
>
> What do you think about renaming the mcast_task to mcast_join_task
> and multicast_list to mcast_join_list? It will make the purpose and
> the analogy between the two more obvious.
I'll do that.
> > static void __exit ipoib_cleanup_module(v
o avoid race conditions (which may lead to a kernel Oops) between
multicast join and multicast leave, we transfer leave processing to the
workqueue (rather than do it in place).
This fixes Bugzilla 1666.
This fix was suggested by Yossi Etigin of Voltaire.
Signed-off-by: Jack Morgenstein
On Monday 29 June 2009 18:06, Moni Shoua wrote:
> Jack Morgenstein wrote:
> > On Sunday 28 June 2009 19:09, Moni Shoua wrote:
> >> maybe synchronizing the race with a completion var (like IPoIB does in
> >> struct ipoib_path) will help. I think this will work. I can sen
On Monday 29 June 2009 07:14, Jack Morgenstein wrote:
> >
> On second thought, maybe it would be simpler to just create an
> ipoib_stop_task(),
> and do everything ipoib_stop() does in that workqueue task. leave would thus
> always
> be executed in the workqueue.
>
On Sunday 28 June 2009 23:04, Yossi Etigin wrote:
> How about making the leave/free mcast operation take place on the
> ipoib_workqueue, on which
> the join operation takes place? this way we can avoid this race, and more
> potential races
> of this kind.
>
On second thought, maybe it would be s
On Sunday 28 June 2009 23:04, Yossi Etigin wrote:
> Jack Morgenstein wrote:
> > in ipoib_mcast_leave():
> > *** NEED TO WAIT HERE BEFORE CONTINUING (so that BUSY is cleared
> > (mcast->mc is in error),
> > *** or BUSY flag is set and mcast-&g
On Sunday 28 June 2009 19:09, Moni Shoua wrote:
> maybe synchronizing the race with a completion var (like IPoIB does in struct
> ipoib_path) will help. I think this will work. I can send a patch if you want
> unless you see this idea doesn't work for this case.
>
> MoniS
I just looked at the ip
On Sunday 28 June 2009 19:09, Moni Shoua wrote:
> maybe synchronizing the race with a completion var
> (like IPoIB does in struct ipoib_path) will help. I think this will work.
> I can send a patch if you want unless you see this idea doesn't work for this
> case.
>
Please do send a patch.
Tha
Hi Shirley,
I was reviewing write-combining for the PPC on kernel 2.6.30, and noticed the
following
in file arch/powerpc/include/asm/pgtable.h:
#define _PAGE_CACHE_CTL (_PAGE_COHERENT | _PAGE_GUARDED | _PAGE_NO_CACHE | \
_PAGE_WRITETHRU)
...
#define pgprot_noncached_wc(p
We have seen the following kernel Oops on IPoIB:
ib0: multicast join failed for ff12:401b::::::, status
-22
Unable to handle kernel paging request for data at address 0x0054
adFaulting instruction address: 0xe60b43c4
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP
On Monday 08 June 2009 18:08, Nicolas Morey-Chaisemartin wrote:
> I'm still having difficulties to understand how mainstream code and ofed code
> interacts.
>
The base kernel files (in this case, 2.6.30) are taken unmodified into OFED.
Adjustments (patches) to the base kernel are placed in direct
OFED 1.5 is still based on 2.6.30-rc2. If this patch in is 2.6.30-rc8,
we will grab it from the mainstream within the next couple of days (when we
rebase to that RC).
(For that reason, I'm not checking this in as a patch right now).
-Jack
On Monday 08 June 2009 11:52, Nicolas Morey-Chaisemartin
On Friday 05 June 2009 15:44, Tom Talpey wrote:
> If you want, I'll dig up the git change.
>
Thanks, but no need. I know about that one. This is a different bug.
-Jack
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org
On Friday 05 June 2009 20:31, Roland Dreier wrote:
>
> > Link to message towards end of thread (with very specific problem
> description):
> > http://lists.openfabrics.org/pipermail/general/2009-April/059253.html
>
> > This patch fixes the problem described in the thread.
>
> That is very us
On Friday 05 June 2009 02:47, Roland Dreier wrote:
> > Please try to get this patch into 2.6.30 -- it is an important fix for
> nfsrdma.
>
> Would be easier to get it in if you had a pointer to the NFS/RDMA bug
> report. Not sure why you think this info isn't worth including in the
> changelog.
may begin execution
too early).
Signed-off-by: Jack Morgenstein
---
Roland,
Please try to get this patch into 2.6.30 -- it is an important fix for nfsrdma.
Thanks!
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 20724ae..c4a0264 100644
--- a/drivers/infiniba
use of the crash was found by Vu Pham of Mellanox.
The fix is along the lines suggested by Steve Wise in comment #21 in Bugzilla
1571.
This patch fixes Bugzilla 1571.
Signed-off-by: Jack Morgenstein
---
Roland, please take this for kernel 2.6.30.
diff --git a/drivers/infiniband/hw/mlx4/ml
On Tuesday 05 May 2009 18:06, Jon Mason wrote:
> No, we currently duplicate all the scatterlist functionality. Including
> ncrypto.h would greatly simplify the backport headers, but it is a
> RHEL5.2/5.3 only solution. If this change is needed for all other
> backports, then a better solution wil
On Monday 04 May 2009 17:56, Jon Mason wrote:
> What's even worse is that sg_init_table is already defined in the
> RHEL5.3 headers. When coding up a header cleanup patch for RHEL5.3, I
> noticed it was already defined in linux/ncrypto.h. Also, it's there for
> RHEL5.2 (and a few older kernels).
On Saturday 02 May 2009 14:46, Bart Van Assche wrote:
> Hello,
>
> Yesterday I installed OFED-1.4.1-rc4 on a CentOS 5.3 system and started
> looking at the backported kernel headers. I found the following in the
> header file
> /usr/src/ofa_kernel-1.4.1/kernel_addons/backport/2.6.18-EL5.3/include/
On Saturday 02 May 2009 14:46, Bart Van Assche wrote:
> Does anyone know why sg_init_table() is defined such that it does nothing in
> the backported OFED headers ?
>
My mistake while doing backports.
Will be fixed in rc5.
- Jack
___
general mailing lis
On Monday 27 April 2009 19:55, Jason Gunthorpe wrote:
> OFED tests the past - back ports to old distributions and a random
> non-upstream collection of patches ontop of that. That is fine for end
> users, but..
>
That is not quite the case. We do test regression on the base kernel of
a given OFE
On Monday 27 April 2009 20:31, Bart Van Assche wrote:
> . As an example, commit
> 233e70f4228e78eb2f80dc6650f65d3ae3dbf17c was applied to Linus' tree on
> October 19, 2008. I could not find any trace of this
> patch in the OFED distribution -- not even in
> OFED-1.4.1-20090427-0600.
That is b
On Monday 27 April 2009 13:46, Moni Shoua wrote:
> So, Is there an easy way for upstream kernel users that want user space
> functionality?
>
Why can't they just install OFED? This affects ONLY the infiniband modules,
and has undergone
extensive QA on lots of platforms.
- Jack
On Sunday 26 April 2009 15:58, Sasha Khapyorsky wrote:
> On 14:31 Sun 26 Apr , Jack Morgenstein wrote:
> > >
> > > It should be compatible with the OFED 1.4 userspace.
> > >
> > Beware -- you should not use OFED userspace with a non-ofed kernel for
> &
On Sunday 26 April 2009 21:01, Jason Gunthorpe wrote:
> > In general, you should not use OFED userspace libraries with non-OFED
> > kernel distributions.
>
> That is hugely unfriendly and not really 'the linux way'..
>
> Jason
>
I know. I did A LOT of work to avoid incompatibilities. This part
On Friday 24 April 2009 02:48, Jason Gunthorpe wrote:
> AFAIK, Ubuntu does not do any work on their IB drivers, so the driver
> is stock 2.6.27.
>
> In principle OFED is supposed to start with an upstream kernel and
> backport those drivers to various distributions. OFED 1.3 was using
> 2.6.24, OF
On Monday 20 April 2009 15:05, Nicolas Morey-Chaisemartin wrote:
> HI,
>
> I was wondering why in libibverbs XRC is implemented as patches and not
> directly in the code?
> Are there compatibility problems?
> Latests qperf can't be build even with the latest libibverbs as it requires
> XRC defin
mands.
This patch is an expansion of the INIT_HCA timeout patch submitted by A. Kepner.
Signed-off-by: Jack Morgenstein
Index: ofed_kernel/drivers/infiniband/hw/mthca/mthca_cmd.c
===
--- ofed_kernel.orig/drivers/infiniband/hw/
On Monday 13 April 2009 21:46, akep...@sgi.com wrote:
> Here's a little patch we've been carrying along for a while.
>
> If the num_qp module parameter is set higher than 2^19 or so,
> HCA initialization times out with EBUSY, e.g.:
>
> ib_mthca: probe of 0031:01:00.0 failed with error -16
>
> A
On Sunday 29 March 2009 20:06, Or Gerlitz wrote:
> On Sun, Mar 29, 2009 at 7:35 PM, Roland Dreier wrote:
> > This should bring mainline kernel small message latency to the same
> > level that OFED gets with the PAT support it hacks in.
>
> Interesting... so the ofed support for blue flame (we are
On Saturday 28 March 2009 01:15, Roland Dreier wrote:
> - vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
>
Roland, I notice in the 2.6.29 code that:
1. There is a function validate_pat_support()
You need to get in touch with Mellanox Support (FAE) at this point.
- Jack
On Monday 02 March 2009 17:07, lakshmana swamy wrote:
>
> Hi Jack,
>
>
> Yes, I connected the cable between two ports of same HCA. Without running
> opensmd.
>
> Now the State is " Initializing"
>
> I observed
.
The second schedule_work operation will then find a non-null port->ah_lock,
and will simply overwrite it in update_sm_ah -- resulting in an ah leak.
Signed-off-by: Jack Morgenstein
diff --git a/drivers/infiniband/core/sa_query.c
b/drivers/infiniband/core/sa_query.c
index 7863a50..1865049 1006
On Monday 02 March 2009 16:38, lakshmana swamy wrote:
>
> HI Jack
>
> I have updated the firmware of HCA in both the machines, but the status
> remains same.
> Please have a look at the following outputs.
>
> What may be the problem ?
>
Your physical connection is bad. Check your cable
On Monday 26 January 2009 22:44, Chuck Hartley wrote:
> Is there some IPoIB debug I can turn on somehow?
>
On each of the hosts, you can do the following:
in file
/etc/modprobe.conf
add the following line:
options ib_ipoib debug_level=15
Then, restart the infiniband driver on both hos
You are running VERY old firmware (from 2004), and moreover, on one host
you have 3.0.0, and on the other 3.1.0.
You need to upgrade your firmware.
Contact your Mellanox FAE (support engineer) for instructions.
- Jack
> Hi Jack,
>
> Please find the output of ibstat on both the nodes, .
>
> [r
On Thursday 26 February 2009 12:59, lakshmana swamy wrote:
Please send me the output of console command: ibstat
Maybe you have old FW.
- Jack
>
> Hi Jack and Mahesh
>
> ThanQ for your response.
>
> I have channged the HCA card as well as IB cables also..Ops no use.
>
>
> How can I
Adds device IDs for Mellanox' MT25458
ConnectX+10-GBaseT 10GigE Ethernet devices.
Signed-off-by: Jack Morgenstein
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 6ef2490..84db33b 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -1230,6 +1230,8 @@ s
"DOWN" means that you do not have a physical link between the ports. Check
your cables -- they may be bad, or badly inserted.
- Jack
On Thursday 26 February 2009 08:38, lakshmana swamy wrote:
>
> Hi All
>
> I have been trying to enable the IPoIB communication between two machines.
> The
On Monday 23 February 2009 20:31, Roland Dreier wrote:
> > I'm not sure that it does. This does not make sysfs access atomic wrt
> module unloading.
> > I think an app can still lose it's timeslice while inside the sysfs
> access, and module
> > unload can still occur while the app is waiting
On Monday 23 February 2009 06:40, Roland Dreier wrote:
> Oh I see... we leave the sysfs stuff around way too long, since we want
> to use it for tracking the lifetime of our class device. the patch
> below fixes things for me here... there's still room for substantial
> cleanup but I think this ge
On Friday 06 February 2009 21:39, Brian J. Murrell wrote:
> I get these warnings trying to build with RHEL4U6 and ofa_kernel from OFED
> 1.4:
>
> include/linux/jbd.h:1204:1: warning: "assert_spin_locked" redefined
> In file included from include/linux/wait.h:25,
> from include/li
x.
Signed-off-by: Jack Morgenstein
---
Roland,
I think this patch is a reasonable solution to the sysfs problem of a low-level
driver module being unloaded while sysfs is being accessed for the device.
ib_unregister_device() is always called before the device driver frees up its
resources.
On Sunday 22 February 2009 09:15, Roland Dreier wrote:
> > I ran on RHEL5.2 ...
>
> I suspect that at some point in the 2+ years since 2.6.18 more locking
> was added to sysfs so that this race no longer exists. You could try
> and see if my test (add a sleep to the show method and make sure you
On Sunday 22 February 2009 09:15, Roland Dreier wrote:
> > I ran on RHEL5.2 ...
>
> I suspect that at some point in the 2+ years since 2.6.18 more locking
> was added to sysfs so that this race no longer exists. You could try
> and see if my test (add a sleep to the show method and make sure you
On Friday 20 February 2009 08:50, Roland Dreier wrote:
> What test are you using to hit this race? Are you using a distro kernel
> with OFED?
>
I ran on RHEL5.2, with a ConnectX card, using the following test (source given
at the end of this post):
1. Start the driver.
2. In one console window,
On Wednesday 18 February 2009 00:54, Roland Dreier wrote:
> > Signed-off-by: Jack Morgenstein
> > Signed-off-by: Moni Shua
>
> This doesn't make any sense... Moni was not involved in sending this
> patch at all, and in any case since you are sending the patch your s
We have found a race condition in sysfs.c which occurs when unloading low-level
modules
(e.g., mlx4_ib) in the driver.
Specifically:
Although the kernel takes reference counts on sysfs files, it does not take
such counts
on modules which implement attribute reads.
For example, we have:
static s
not free an existing path -- just leave it in the
list as-is (i.e., with its valid flag cleared).
Thanks to Yossi Etigin of Voltaire for identifying the problem flow
which caused the kernel crash.
Signed-off-by: Jack Morgenstein
Signed-off-by: Moni Shua
---
Roland,
I ran checkpatch.pl on this
On Wednesday 04 February 2009 18:16, Moni Shoua wrote:
> This one looks good to me.
> Are you going to make a patch and submit it?
>
> I think it would be best if you run the same test on the patched IPoIB before
> submission.
> Do you agree?
>
I'll do a patch tomorrow.
We'll run the test over
On Wednesday 04 February 2009 17:45, Moni Shoua wrote:
> Besides the locking issue that I hadn't think about yet what if we this fix
> looks the right thing to do.
> But what if we leave the path without freeing it even if path_rec_start()
> fails?
> This would leave a path which is not valid in
On Wednesday 04 February 2009 15:33, Moni Shoua wrote:
> Isn't the fix just as simple as this?
>
> void ipoib_mark_paths_invalid(struct net_device *dev)
> {
> struct ipoib_dev_priv *priv = netdev_priv(dev);
> struct ipoib_path *path, *tp;
>
> spin_lock_irq(&priv->lock);
>
On Wednesday 04 February 2009 08:46, Jack Morgenstein wrote:
> On Tuesday 03 February 2009 19:56, Yossi Etigin wrote:
> > I think it comes from unicast_arp_send.
> > Consider this scenario:
> > - paths are flushed (opensm up/down).
> > - unicast_arp_send() is called wit
On Tuesday 03 February 2009 19:56, Yossi Etigin wrote:
> I think it comes from unicast_arp_send.
> Consider this scenario:
> - paths are flushed (opensm up/down).
> - unicast_arp_send() is called with a path in priv->path_list. path->valid is
> 0.
> - path_rec_start() fails with -EAGAIN (-11) beca
We saw the following kernel panic when testing ipoib stability intensively
by simultaneously (i.e., in separate processes, with random wait intervals)
doing:
- ifconfig up/down
- opensm up/down
- ipoib ping
- arp delete
- driver up/down
ib0: ib_sa_path_rec_get failed: -11
ib0: ib_sa_path_rec_get
On Wednesday 28 January 2009 20:53, Roland Dreier wrote:
> > - priv->mcast_mtu =
> IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
> > + spin_lock_irq(&priv->lock);
> > + if (priv->broadcast)
> > + priv->mcast_mtu =
> IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcas
in_task.
There is a race whereby the ipoib broadcast pointer may be set to NULL by flush
while the join
task is being started. This protects the broadcast pointer access via a
spinlock. If the
pointer is indeed NULL, we set the mcast_mtu value to the current admin_mtu
value -- since
The following Oops occurred several times on an X86 host when unloading the
driver:
(console command sequence:
/etc/init.d/openibd start
opensm &
pkill -2 opensm
/etc/init.d/openibd stop
)
IP: [] :ib_ipoib:ipoib_mcast_join_ta
We saw the following kernel panic when testing ipoib stability intensively
by simultaneously (i.e., in separate processes, with random wait intervals)
doing:
- ifconfig up/down
- opensm up/down
- ipoib ping
- arp delete
- driver up/down
Does anyone have ideas as to what might have happened?
(the
On Friday 16 January 2009 22:10, Roland Dreier wrote:
> So I'll merge the patch with the wmb() there, and you can convince me to
> get rid of it later if my reasoning is wrong.
>
We did performance testing on your version of the patch, and my version,
and there was no statistically significant
On Friday 16 January 2009 22:02, Roland Dreier wrote:
> OK, I think I'm going to merge my version of the patch. If there really
> is a performance penalty I'd rather move the mlx transport stuff
> out-of-line first rather than make the code too unreadble with gotos and
> duplication etc.
>
Roland
On Wednesday 14 January 2009 12:21, Philip Frey1 wrote:
> Hello,
>
> I recently upgraded from OFED 1.3 to 1.4 and the behaviour of an STag of
> zero seems to have changed.
>
Did you try sending with
send_wr.sg_list = NULL;
send_wr.num_sge = 0;
?
(if this works, it should resul
rated, check for CLIENT REREG.
Thoughts?
--Original Message-
> From: Moni Shoua [mailto:mo...@voltaire.com]
> Sent: Wednesday, January 14, 2009 6:21 PM
> To: Roland Dreier
> Cc: Jack Morgenstein; Olga Stern; Yossi Etigin; OpenFabrics General
> Subject: Re: [ofa-general] [PATCH] m
1 - 100 of 450 matches
Mail list logo