On Wednesday 09 September 2009, Tziporet Koren wrote:
Seems that IB connection is lost completely (lustre not working, can't
ping remote IPoIB addresses etc.).
It seems HW issue but to ensure it can you see if it also happened with
OFED 1.4?
If yes please approach Mellanox support since
Hello Pawel,
sorry for my late reply.
On Monday 06 April 2009, Pawel Dziekonski wrote:
On Thu, 02 Apr 2009 at 08:07:20PM +0200, Bernd Schubert wrote:
Hello,
I'm fighting (as usual) with some Lustre problems and I think this time
it is IB related. In the logs of some systems I see
On Monday 06 April 2009, Tziporet Koren wrote:
Bernd Schubert wrote:
Hello,
I'm fighting (as usual) with some Lustre problems and I think this time
it is IB related. In the logs of some systems I see messages like these:
ib_mthca :0d:00.0: Async event 16 for bogus QP 00da0407
Hello,
last week I had issues with Lustre failures, which turned out to be
failures of many clients, which run into out-of-memory due to bad user space
jobs
(and no protection again that by the queuing system).
Anyway, I don't think IB is supposed to fail, when the oom killer activates.
Hello,
I'm fighting (as usual) with some Lustre problems and I think this time it is
IB related. In the logs of some systems I see messages like these:
ib_mthca :0d:00.0: Async event 16 for bogus QP 00da0407
Anyone knows what is the meaning of that? The kernel modules are from
/bernd/downloads/infiniband/etch ./
# Hardy packages, but not recently maintained (only for my workstation)
deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/hardy/ ./
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
On Fri, Jul 11, 2008 at 07:47:44PM +, John Marshall wrote:
Roland Dreier wrote:
There seem to be a lot of people on this list. Is there
no one working with debian that might be able to
give a pointer?
I use Debian for pretty much all my development. However I haven't
tried to use
On Fri, Jul 11, 2008 at 01:09:37PM -0700, Roland Dreier wrote:
Except mstflint we have already packaged these. Roland, would you sponsor
our
uploads (we first need to add proper package descriptions, though). Or
could we create an alioth group?
I am not a Debian developer either...
a lot of work.
Here is a sources.list entry for etch (amd64) and packages we already have
(sources packages will follow later on).
deb http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/infiniband/ ./
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
Hello Yevgeny!
On Tuesday 08 April 2008 22:22:38 Yevgeny Kliteynik wrote:
Sasha Copyist wrote:
Hi Bernd,
[adding Yevgeny..]
On 11:35 Tue 08 Apr , Bernd Schubert wrote:
On Tuesday 08 April 2008 03:44:06 Sasha Copyist wrote:
Hi Bernd,
On 11:47 Fri 04 Apr , Bernd Schubert
.
What about
# The maximum is 0x14, which will disable this mechanism.\n
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
Hello Sasha,
On Tuesday 08 April 2008 03:44:06 Sasha Copyist wrote:
Hi Bernd,
On 11:47 Fri 04 Apr , Bernd Schubert wrote:
opensm-3.2.1 logs some error messages like this:
Apr 04 00:00:08 325114 [4580A960] 0x01 -
__osm_state_mgr_light_sweep_start: ERR 0108: Unknown remote side
Hello Hal,
On Monday 07 April 2008 15:35:10 Hal Rosenstock wrote:
Hi Bernd,
On Sun, 2008-04-06 at 18:05 +0200, Bernd Schubert wrote:
Hello Hal,
Searching for this error I find This is a symptom of congestion and
may require tweaking either HOQ or switch lifetime values.
Well, I
Hello Hal,
On Sat, Apr 05, 2008 at 06:19:43AM -0700, Hal Rosenstock wrote:
Hi Bernd,
On Sat, 2008-04-05 at 00:12 +0200, Bernd Schubert wrote:
Hello,
after I upgraded one of our clusters to opensm-3.2.1 it seems to have
gotten
much better there, at least no further
Hello Sasha,
On Sun, Apr 06, 2008 at 06:53:14AM +, Sasha Khapyorsky wrote:
On 01:45 Sat 05 Apr , Bernd Schubert wrote:
Hmm, I first increased head_of_queue_lifetime to 0x13 and
leaf_head_of_queue_lifetime to 0x20, but this didn't make the error
go away. So I increased
error in line: %s\n,
+__func__, line);
return -1;
}
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http
the other interconnections seem to be fine.
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http
On Fri, Apr 04, 2008 at 10:55:21AM -0700, Hal Rosenstock wrote:
On Fri, 2008-04-04 at 11:47 +0200, Bernd Schubert wrote:
Hello,
opensm-3.2.1 logs some error messages like this:
Apr 04 00:00:08 325114 [4580A960] 0x01 -
__osm_state_mgr_light_sweep_start:
ERR 0108: Unknown remote
on Flextronics switches?
Thanks for any help,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org
Hello Boris,
On Fri, Apr 04, 2008 at 03:28:46PM -0700, Boris Shpolyansky wrote:
Hi Bernd,
You can configure the HOQ (Head-Of-Queue-Lifetime) value programmed in
any switch in the fabric managed by OpenSM following these simple steps:
1. Stop the SM
/etc/init.d/opensmd stop
2. Run the
On Fri, Apr 04, 2008 at 03:29:32PM -0700, Ira Weiny wrote:
On Sat, 5 Apr 2008 00:12:39 +0200
Bernd Schubert [EMAIL PROTECTED] wrote:
Hello,
after I upgraded one of our clusters to opensm-3.2.1 it seems to have
gotten
much better there, at least no further RcvSwRelayErrors, even
()
1: [RcvSwRelayErrors == 196]
3: [RcvSwRelayErrors == 242]
[...]
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit
On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
Hello,
on one of our systems we get a rather huge numbers of RcvSwRelayErrors.
All I find about RcvSwRelayErrors is
This counter can increase due to a valid network event
On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 12:30 +0100, Bernd Schubert wrote:
Hello,
on one of our systems we get a rather huge
On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote:
On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 13:54 +0100, Bernd Schubert wrote:
On Thursday 20 March 2008 13:27:36 Hal Rosenstock wrote
On Thursday 20 March 2008 15:41:40 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 15:33 +0100, Bernd Schubert wrote:
On Thursday 20 March 2008 15:29:35 Hal Rosenstock wrote:
On Thu, 2008-03-20 at 15:27 +0100, Bernd Schubert wrote:
On Thursday 20 March 2008 15:12:03 Hal Rosenstock wrote
+0x47/0x75
[54277.944032] [8020a2f8] child_rip+0xa/0x12
[54277.950581]
Any ideas?
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo
On Thursday 28 February 2008 18:42:19 Bernd Schubert wrote:
Hello,
on several on our Lustre Servers we can see page allocation failures.
This is with 2.6.22 + kernel modules from ofed 1.2.5
Er, correction, it's 1.2.5.5
[44464.764559] Lustre: 24052:0:(ldlm_lib.c:698:target_handle_connect
batch of HCAs or old firmware or something.
So we know its either a problem with 2.6.22 or a hardware problem. I will test
2.6.24-rc1 during the next days and keep you posted.
Thanks again,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing
to the mts2400
switch, don't we?
Furthermore, from manpage I would think we even need this
option for proper inter-communication between the switch modules?
Thanks in advance,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general
for userspace QPs from query QP
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman
On Thursday 08 November 2007 23:39:44 Bernd Schubert wrote:
At least defining the ib0 interface works now, too. Presently can't further
test it now, since its connected to a flaky mts2400 switch, which needs a
reset.
In principal the cards do work, but only port-1. There's a problem with
port
at once, with both ports
connected to the same fabric? If so you're probably running into the
standard ARP filtering issue...
No, only one port connected.
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general
:)
6e694ea3 (IB/mlx4: Fix data corruption triggered by wrong headroom marking
order)
Hmm, sometimes I'm simply blind ;)
Thanks for the hint!
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http
exactly and reproducable on all 6 nodes with connectX presently here in
our test lab.
Just by accident I first always had connected port 2. Shortly before I already
thought it doesn't work at all, I tried the other port...
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
On Thu, Nov 08, 2007 at 02:27:03PM -0800, Roland Dreier wrote:
Roland, thanks for your help. On modprobing this module simply nothing
does
happen, absolutely nothing in dmesg.
Do you know if mlx4_core was already loaded? It probably would be
loaded by driver autoloading based on
Capabilities: [48] Vital Product Data
Capabilities: [84] MSI-X: Enable- Mask- TabSize=256
Capabilities: [60] Express Endpoint IRQ 0
I tried to grep for mlx4 devices, but don't find any definition at all.
Any help is appreciated.
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks
On Thu, Nov 08, 2007 at 01:34:08PM -0800, Roland Dreier wrote:
we have a card here that is not supported by 2.6.22, I haven't tested
2.6.23
yet, but if I'm not mistaken this pci-id is not defined in 2.6.23 too.
09:00.0 InfiniBand: Mellanox Technologies Unknown device 634a (rev a0)
Yesterday I also had this problem, can you try modprobe rdma_ucm
and/or modprobe rdma_cm?
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Hello Sasha,
On Thursday 23 August 2007 16:46:16 you wrote:
Hi,
On 15:46 Thu 23 Aug , Bernd Schubert wrote:
I can't make opensm using the UPDN algorithm. Presently we have 3
switched connected with each other and one is supposed to be the root
switch.
I created an /etc/osm.root
to the next week.
Thanks for your help,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman
0002c9010c7341f0
I will post the patch shortly (still need to add an optional different
filename then /etc/ib.id). Anyway, do already wish to have something
differently when you look on the new output?
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
On Friday 22 June 2007 20:57:49 you wrote:
Bernd Schubert wrote:
Hi,
there are patches to make rdma of ofed-1.1 compatible with 2.6.20
(https://svn.openfabrics.org/svn/openib/gen2/trunk/ofed/patches/user_fixes
/ librdmacm_to_2_6_20.patch and perftest_to_2_6_20.patch).
The entrire
`private_data_len': This is easy to fix.
Is there a more recent working version of the patch available or can you
give me at least some hints what to do with the rdma_get_option() call?
Thanks in advance,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
patch and a couple of hours uptime no warnings so far, so
I guess your patch fixed it.
If you shouldn't hear anything from me until Thursday, it definitely fixed it.
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
___
general mailing list
general
On Friday 27 April 2007 20:11:11 Ralph Campbell wrote:
This patch fixes the problem reported by Bernd Schubert [EMAIL PROTECTED]
with kernel debug options enabled.
BUG: at kernel/lockdep.c:1860 trace_hardirqs_on()
Hopefully, this can be included in OFED 1.2 as well as
going upstream
+0x37/0x51
[ 2651.410972] [80207cb8] default_idle+0x35/0x51
[ 2651.416824] [80207e4e] cpu_idle+0x5b/0x7a
[ 2651.422270] [8063566c] start_secondary+0x470/0x47f
[ 2651.428267]
Kernel is 2.6.20.4.
Any ideas?
Thanks,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
-s_cur_sge = qp-s_sge and qp-s_sge.sge = wqe-sg_list[0];
If I see it right wqe-sg_list[0] or wqe-sg_list[0].vaddr is wrong, but so
far I havn't tracked down where this is set.
Any help to solve the problem is appreciated.
Thanks in advance,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
49 matches
Mail list logo