Re: [ofa-general] ibpanic

2009-09-09 Thread Bernd Schubert
On Wednesday 09 September 2009, Tziporet Koren wrote: Seems that IB connection is lost completely (lustre not working, can't ping remote IPoIB addresses etc.). It seems HW issue but to ensure it can you see if it also happened with OFED 1.4? If yes please approach Mellanox support since

Re: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407

2009-04-11 Thread Bernd Schubert
Hello Pawel, sorry for my late reply. On Monday 06 April 2009, Pawel Dziekonski wrote: On Thu, 02 Apr 2009 at 08:07:20PM +0200, Bernd Schubert wrote: Hello, I'm fighting (as usual) with some Lustre problems and I think this time it is IB related. In the logs of some systems I see

Re: [ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407

2009-04-11 Thread Bernd Schubert
On Monday 06 April 2009, Tziporet Koren wrote: Bernd Schubert wrote: Hello, I'm fighting (as usual) with some Lustre problems and I think this time it is IB related. In the logs of some systems I see messages like these: ib_mthca :0d:00.0: Async event 16 for bogus QP 00da0407

[ofa-general] mlx4: errors and failures on OOM

2009-04-11 Thread Bernd Schubert
Hello, last week I had issues with Lustre failures, which turned out to be failures of many clients, which run into out-of-memory due to bad user space jobs (and no protection again that by the queuing system). Anyway, I don't think IB is supposed to fail, when the oom killer activates.

[ofa-general] ib_mthca 0000:0d:00.0: Async event 16 for bogus QP 00da0407

2009-04-02 Thread Bernd Schubert
Hello, I'm fighting (as usual) with some Lustre problems and I think this time it is IB related. In the logs of some systems I see messages like these: ib_mthca :0d:00.0: Async event 16 for bogus QP 00da0407 Anyone knows what is the meaning of that? The kernel modules are from

Re: [ofa-general] ConnectX IB HCA with Ubuntu 8.04

2008-09-05 Thread Bernd Schubert
/bernd/downloads/infiniband/etch ./ # Hardy packages, but not recently maintained (only for my workstation) deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./ Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH

Re: [ofa-general] debian build instructions?

2008-07-11 Thread Bernd Schubert
On Fri, Jul 11, 2008 at 07:47:44PM +, John Marshall wrote: Roland Dreier wrote: There seem to be a lot of people on this list. Is there no one working with debian that might be able to give a pointer? I use Debian for pretty much all my development. However I haven't tried to use

Re: [ofa-general] debian build instructions?

2008-07-11 Thread Bernd Schubert
On Fri, Jul 11, 2008 at 01:09:37PM -0700, Roland Dreier wrote: Except mstflint we have already packaged these. Roland, would you sponsor our uploads (we first need to add proper package descriptions, though). Or could we create an alioth group? I am not a Debian developer either...

Re: [ofa-general] debian build instructions?

2008-07-10 Thread Bernd Schubert
a lot of work. Here is a sources.list entry for etch (amd64) and packages we already have (sources packages will follow later on). deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/ ./ Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH

Re: [ofa-general] ERR 0108: Unknown remote side

2008-04-09 Thread Bernd Schubert
Hello Yevgeny! On Tuesday 08 April 2008 22:22:38 Yevgeny Kliteynik wrote: Sasha Copyist wrote: Hi Bernd, [adding Yevgeny..] On 11:35 Tue 08 Apr , Bernd Schubert wrote: On Tuesday 08 April 2008 03:44:06 Sasha Copyist wrote: Hi Bernd, On 11:47 Fri 04 Apr , Bernd Schubert

Re: [ofa-general] Re: [PATCH] opensm/opensm/osm_subnet.c: add checks for HOQ and Leaf HOQ input values

2008-04-09 Thread Bernd Schubert
. What about # The maximum is 0x14, which will disable this mechanism.\n Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

Re: [ofa-general] ERR 0108: Unknown remote side

2008-04-08 Thread Bernd Schubert
Hello Sasha, On Tuesday 08 April 2008 03:44:06 Sasha Copyist wrote: Hi Bernd, On 11:47 Fri 04 Apr , Bernd Schubert wrote: opensm-3.2.1 logs some error messages like this: Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side

Re: [ofa-general] XmtDiscards

2008-04-07 Thread Bernd Schubert
Hello Hal, On Monday 07 April 2008 15:35:10 Hal Rosenstock wrote: Hi Bernd, On Sun, 2008-04-06 at 18:05 +0200, Bernd Schubert wrote: Hello Hal, Searching for this error I find This is a symptom of congestion and may require tweaking either HOQ or switch lifetime values. Well, I

Re: [ofa-general] XmtDiscards

2008-04-06 Thread Bernd Schubert
Hello Hal, On Sat, Apr 05, 2008 at 06:19:43AM -0700, Hal Rosenstock wrote: Hi Bernd, On Sat, 2008-04-05 at 00:12 +0200, Bernd Schubert wrote: Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further

Re: [ofa-general] XmtDiscards

2008-04-06 Thread Bernd Schubert
Hello Sasha, On Sun, Apr 06, 2008 at 06:53:14AM +, Sasha Khapyorsky wrote: On 01:45 Sat 05 Apr , Bernd Schubert wrote: Hmm, I first increased head_of_queue_lifetime to 0x13 and leaf_head_of_queue_lifetime to 0x20, but this didn't make the error go away. So I increased

[ofa-general] [PATCH] parse_node_map: print parse errors

2008-04-04 Thread Bernd Schubert
error in line: %s\n, +__func__, line); return -1; } Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http

[ofa-general] ERR 0108: Unknown remote side

2008-04-04 Thread Bernd Schubert
the other interconnections seem to be fine. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http

Re: [ofa-general] ERR 0108: Unknown remote side

2008-04-04 Thread Bernd Schubert
On Fri, Apr 04, 2008 at 10:55:21AM -0700, Hal Rosenstock wrote: On Fri, 2008-04-04 at 11:47 +0200, Bernd Schubert wrote: Hello, opensm-3.2.1 logs some error messages like this: Apr 04 00:00:08 325114 [4580A960] 0x01 - __osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote

[ofa-general] XmtDiscards

2008-04-04 Thread Bernd Schubert
on Flextronics switches? Thanks for any help, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org

Re: [ofa-general] XmtDiscards

2008-04-04 Thread Bernd Schubert
Hello Boris, On Fri, Apr 04, 2008 at 03:28:46PM -0700, Boris Shpolyansky wrote: Hi Bernd, You can configure the HOQ (Head-Of-Queue-Lifetime) value programmed in any switch in the fabric managed by OpenSM following these simple steps: 1. Stop the SM /etc/init.d/opensmd stop 2. Run the

Re: [ofa-general] XmtDiscards

2008-04-04 Thread Bernd Schubert
On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote: On Sat, 5 Apr 2008 00:12:39 +0200 Bernd Schubert [EMAIL PROTECTED] wrote: Hello, after I upgraded one of our clusters to opensm-3.2.1 it seems to have gotten much better there, at least no further RcvSwRelayErrors, even

[ofa-general] RcvSwRelayErrors

2008-03-20 Thread Bernd Schubert
() 1: [RcvSwRelayErrors == 196] 3: [RcvSwRelayErrors == 242] [...] -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit

Re: [ofa-general] RcvSwRelayErrors

2008-03-20 Thread Bernd Schubert
On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote: On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote: Hello, on one of our systems we get a rather huge numbers of RcvSwRelayErrors. All I find about RcvSwRelayErrors is This counter can increase due to a valid network event

Re: [ofa-general] RcvSwRelayErrors

2008-03-20 Thread Bernd Schubert
On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote: On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote: On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote: On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote: Hello, on one of our systems we get a rather huge

Re: [ofa-general] RcvSwRelayErrors

2008-03-20 Thread Bernd Schubert
On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote: On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote: On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote: On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote: On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote

Re: [ofa-general] RcvSwRelayErrors

2008-03-20 Thread Bernd Schubert
On Thursday 20 March 2008 15:41:40 Hal Rosenstock wrote: On Thu, 2008-03-20 at 15:33 +0100, Bernd Schubert wrote: On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote: On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote: On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote

[ofa-general] page allocation failure

2008-02-28 Thread Bernd Schubert
+0x47/0x75 [54277.944032] [8020a2f8] child_rip+0xa/0x12 [54277.950581] Any ideas? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo

Re: [ofa-general] page allocation failure

2008-02-28 Thread Bernd Schubert
On Thursday 28 February 2008 18:42:19 Bernd Schubert wrote: Hello, on several on our Lustre Servers we can see page allocation failures. This is with 2.6.22 + kernel modules from ofed 1.2.5 Er, correction, it's 1.2.5.5 [44464.764559] Lustre: 24052:0:(ldlm_lib.c:698:target_handle_connect

Re: [ofa-general] MT25418

2007-11-29 Thread Bernd Schubert
batch of HCAs or old firmware or something. So we know its either a problem with 2.6.22 or a hardware problem. I will test 2.6.24-rc1 during the next days and keep you posted. Thanks again, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing

[ofa-general] opensm --lcm

2007-11-12 Thread Bernd Schubert
to the mts2400 switch, don't we? Furthermore, from manpage I would think we even need this option for proper inter-communication between the switch modules? Thanks in advance, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general

Re: [ofa-general] MT25418

2007-11-09 Thread Bernd Schubert
for userspace QPs from query QP Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman

Re: [ofa-general] MT25418

2007-11-09 Thread Bernd Schubert
On Thursday 08 November 2007 23:39:44 Bernd Schubert wrote: At least defining the ib0 interface works now, too. Presently can't further test it now, since its connected to a flaky mts2400 switch, which needs a reset. In principal the cards do work, but only port-1. There's a problem with port

Re: [ofa-general] MT25418

2007-11-09 Thread Bernd Schubert
at once, with both ports connected to the same fabric? If so you're probably running into the standard ARP filtering issue... No, only one port connected. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general

Re: [ofa-general] MT25418

2007-11-09 Thread Bernd Schubert
:) 6e694ea3 (IB/mlx4: Fix data corruption triggered by wrong headroom marking order) Hmm, sometimes I'm simply blind ;) Thanks for the hint! Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http

Re: [ofa-general] MT25418

2007-11-09 Thread Bernd Schubert
exactly and reproducable on all 6 nodes with connectX presently here in our test lab. Just by accident I first always had connected port 2. Shortly before I already thought it doesn't work at all, I tried the other port... Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH

Re: [ofa-general] MT25418

2007-11-08 Thread Bernd Schubert
On Thu, Nov 08, 2007 at 02:27:03PM -0800, Roland Dreier wrote: Roland, thanks for your help. On modprobing this module simply nothing does happen, absolutely nothing in dmesg. Do you know if mlx4_core was already loaded? It probably would be loaded by driver autoloading based on

[ofa-general] MT25418

2007-11-08 Thread Bernd Schubert
Capabilities: [48] Vital Product Data Capabilities: [84] MSI-X: Enable- Mask- TabSize=256 Capabilities: [60] Express Endpoint IRQ 0 I tried to grep for mlx4 devices, but don't find any definition at all. Any help is appreciated. Thanks, Bernd -- Bernd Schubert Q-Leap Networks

Re: [ofa-general] MT25418

2007-11-08 Thread Bernd Schubert
On Thu, Nov 08, 2007 at 01:34:08PM -0800, Roland Dreier wrote: we have a card here that is not supported by 2.6.22, I haven't tested 2.6.23 yet, but if I'm not mistaken this pci-id is not defined in 2.6.23 too. 09:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0)

Re: [ofa-general] MPI IB Errors

2007-08-23 Thread Bernd Schubert
Yesterday I also had this problem, can you try modprobe rdma_ucm and/or modprobe rdma_cm? Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo

Re: [ofa-general] MPI IB Errors

2007-08-23 Thread Bernd Schubert
-- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] UPDN algorithm

2007-08-23 Thread Bernd Schubert
Hello Sasha, On Thursday 23 August 2007 16:46:16 you wrote: Hi, On 15:46 Thu 23 Aug , Bernd Schubert wrote: I can't make opensm using the UPDN algorithm. Presently we have 3 switched connected with each other and one is supposed to be the root switch. I created an /etc/osm.root

Re: [ofa-general] UPDN algorithm

2007-08-23 Thread Bernd Schubert
to the next week. Thanks for your help, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman

Re: [ofa-general] ibnetdiscover

2007-08-08 Thread Bernd Schubert
0002c9010c7341f0 I will post the patch shortly (still need to add an optional different filename then /etc/ib.id). Anyway, do already wish to have something differently when you look on the new output? Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH

Re: [ofa-general] librdmacm_to_2_6_20.patch

2007-06-25 Thread Bernd Schubert
On Friday 22 June 2007 20:57:49 you wrote: Bernd Schubert wrote: Hi, there are patches to make rdma of ofed-1.1 compatible with 2.6.20 (https://svn.openfabrics.org/svn/openib/gen2/trunk/ofed/patches/user_fixes / librdmacm_to_2_6_20.patch and perftest_to_2_6_20.patch). The entrire

[ofa-general] librdmacm_to_2_6_20.patch

2007-06-22 Thread Bernd Schubert
`private_data_len': This is easy to fix. Is there a more recent working version of the patch available or can you give me at least some hints what to do with the rdma_get_option() call? Thanks in advance, Bernd -- Bernd Schubert Q-Leap Networks GmbH

Re: [ofa-general] possible irq lock inversion dependency detected

2007-05-16 Thread Bernd Schubert
patch and a couple of hours uptime no warnings so far, so I guess your patch fixed it. If you shouldn't hear anything from me until Thursday, it definitely fixed it. Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH ___ general mailing list general

Re: [ofa-general] [PATCH] IB/ipath - Don't call spin_lock_irq() from interrupt context

2007-05-07 Thread Bernd Schubert
On Friday 27 April 2007 20:11:11 Ralph Campbell wrote: This patch fixes the problem reported by Bernd Schubert [EMAIL PROTECTED] with kernel debug options enabled. BUG: at kernel/lockdep.c:1860 trace_hardirqs_on() Hopefully, this can be included in OFED 1.2 as well as going upstream

[ofa-general] ipath irq bug

2007-04-20 Thread Bernd Schubert
+0x37/0x51 [ 2651.410972] [80207cb8] default_idle+0x35/0x51 [ 2651.416824] [80207e4e] cpu_idle+0x5b/0x7a [ 2651.422270] [8063566c] start_secondary+0x470/0x47f [ 2651.428267] Kernel is 2.6.20.4. Any ideas? Thanks, Bernd -- Bernd Schubert Q-Leap Networks GmbH

[ofa-general] ipath oops

2007-03-30 Thread Bernd Schubert
-s_cur_sge = qp-s_sge and qp-s_sge.sge = wqe-sg_list[0]; If I see it right wqe-sg_list[0] or wqe-sg_list[0].vaddr is wrong, but so far I havn't tracked down where this is set. Any help to solve the problem is appreciated. Thanks in advance, Bernd -- Bernd Schubert Q-Leap Networks GmbH