Re: [Lustre-discuss] socknal_sd00 100% lower?
On Fri, Mar 07, 2008 at 03:26:23PM -0700, Andreas Dilger wrote: Maxim, Isaac, what are your thoughts about disabling IRQ affinity by default? In the past this was important for maximizing performance with N CPUs and N ethernet NICs, but the CPUs have gotten much faster and more cores and I believe other customers have found better performance with irq_affinity disabled. Agree, and with ksocklnd bonding feature deprecated it's now more common to configure lnet with a single NIC. I've committed the change and filed a documentation bug to update the manuals accordingly. Thanks, Isaac ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre dstat plugin
On Mar 9, 2008, at 10:03 PM, Aaron Knister wrote: Just wondering if either of you have used collectl if/and which you prefer- dstat or collectl. Never used it, Looks like they solve the same problem. I like dstat for the simple plugins. (if your a better python programer than me). And how you can pull our results, like I use the following on our lustre OSS with two OST's sda and sdb. dstat -D sda,sdb,total That gives me per disk stats and a total. Similar tools could be made for collectl I'm sure. Brock -Aaron On Mar 7, 2008, at 7:03 PM, Brock Palen wrote: On Mar 7, 2008, at 6:58 PM, Kilian CAVALOTTI wrote: Hi Brock, On Wednesday 05 March 2008 05:21:51 pm Brock Palen wrote: I have wrote a lustre dstat plugin. You can find it on my blog: That's cool! Very useful for my daily work, thanks! Thanks! Its the first python I ever wrote. It only works on clients, and has not been tested on multiple mounts, Its very simple just reads /proc/ It indeed doesn't read stats for multiple mounts. I slightly modified it so it can display read/write numbers for all the mounts it founds (see the attached patch). This is great idea Here's a typical output for a rsync transfer from scrath to home: -- 8 --- $ dstat -M lustre Module dstat_lustre is still experimental. --scratch---home--- read write: read write 110M0 : 0 110M 183M0 : 0 183M 184M0 : 0 184M -- 8 --- Maybe it could be useful to also add the other metrics from the stat file, but I'm not sure which ones would be the more relevant. And it would probably be wise to do that in a separate module, like lustre_stats, to avoid clutter. Yes, dstat comes with plugins for nfsv3 and has two modules, dstat_nfs3 and dstat_nfs3op which has extended details. So I think this would be a good idea to follow that model. Anyway, great job, and thanks for sharing it! Thanks again. Cheers, -- Kiliandstat_lustre.diff ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] yet another lustre error
On Mar 9, 2008, at 10:01 PM, Aaron Knister wrote: Hi! I have a few questions for you- 1. How many nodes was his job running on? around 64 serial jobs accessing the same directory (not the same files). 2. What version of lustre and linux kernel are you running on your servers/clients? Lustre servers: 2.6.9-55.0.9.EL_lustre.1.6.4.1smp Clients: 2.6.9-67.0.1.ELsmp 3. What ethernet module are you using on the servers/clients? Most use the tg3, some use e1000. I honestly am not sure what the RPC errors mean but I've had similar issues caused by ethernet-level errors. Over the weekend the MDS/MGS went into a unhealthy state forced a reboot+fsck and when it came back up the directory was accessible again and jobs started working again. -Aaron On Mar 7, 2008, at 6:45 PM, Brock Palen wrote: On a file system thats been up for only 57 days, I have: 505 lustre-log. dumps. THe problem at hand is a user has many jobs where his jobs are now hung trying to create a directory from his pbs script. On the clients i see: LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_connect operation failed with -16 LustreError: Skipped 2 previous similar messages On every client his jobs are on. In the most recent /tmp/lustre-log. on the MDS/MGS I see this message: @@@ processing error (-16) [EMAIL PROTECTED] x12808293/t0 o38- [EMAIL PROTECTED]:-1 lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 ldlm_lib.c target_handle_reconnect nobackup-MDT: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting ldlm_lib.c target_handle_connect nobackup-MDT: refuse reconnection from 34b4fbea-200b-1f7c- [EMAIL PROTECTED]@tcp to 0x0100069a7000; still busy with 2 active RPCs ldlm_lib.c target_send_reply_msg @@@ processing error (-16) [EMAIL PROTECTED] x11199816/t0 o38- [EMAIL PROTECTED]:-1 lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 What I see messages about active rpc's in other logs. What would this mean? Is something suck someplace ? Brock Palen Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
Isaac, Thanks for the quick response. A quick google search didn't tell me how I can check the module parameters. What command or file should I check for this? And as you requested: [EMAIL PROTECTED] ~]# ls /lib/modules/2.6.18-53.1.14.el5.lustre /kernel/net/lustre ksocklnd.ko libcfs.ko lnet.ko lnet_selftest.ko [EMAIL PROTECTED] ~]# rpm -ql lustre-modules /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/llite_lloop.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/lov.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/lquota.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/lustre.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/lvfs.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/mdc.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/mgc.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/obdclass.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/obdecho.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/osc.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre/ptlrpc.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre/ksocklnd.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre/libcfs.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre/lnet.ko /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre/lnet_selftest.ko /usr/share/doc/lustre-modules-1.6.4.3 /usr/share/doc/lustre-modules-1.6.4.3/COPYING Thank you! On Mon, Mar 10, 2008 at 11:02 AM, Isaac Huang [EMAIL PROTECTED] wrote: On Mon, Mar 10, 2008 at 10:04:50AM -0500, mitcheloc wrote: [EMAIL PROTECTED] ~]# dmesg Lustre: OBD class driver, [EMAIL PROTECTED] Lustre Version: [3]1.6.4.3 Build Version: 1.6.4.3-1969123116-PRISTINE-.usr.src.linux-2.6.18-53.1.14.el5.lustr e Lustre: Added LNI [EMAIL PROTECTED] [8/256] LustreError: 2359:0:(api-ni.c:1025:lnet_startup_lndnis()) Can't load LND elan, module kqswlnd, rc=256 LNet couldn't load the driver module (kqswlnd) for elan. What's your lnet module parameters? Please also run: ls /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre rpm -ql lustre-modules Thanks, Isaac Lustre: Removed LNI [EMAIL PROTECTED] LustreError: 2359:0:(events.c:654:ptlrpc_init_portals()) network initialisation failed ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
On Mon, Mar 10, 2008 at 11:19:54AM -0500, mitcheloc wrote: Isaac, Thanks for the quick response. A quick google search didn't tell me how I can check the module parameters. What command or file should I check for this? It shall be in /etc/modprobe.conf or some file under /etc/modprobe.d. Exact location depends on your distribution. Look for a line that starts with options lnet . And as you requested: [EMAIL PROTECTED] ~]# ls /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre ksocklnd.ko libcfs.ko lnet.ko lnet_selftest.ko The kqswlnd.ko is missing. Isaac ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
From modprobe.conf: options lnet networks=tcp0,elan0 Where should kqswlnd.ko be coming from? On Mon, Mar 10, 2008 at 11:35 AM, Isaac Huang [EMAIL PROTECTED] wrote: On Mon, Mar 10, 2008 at 11:19:54AM -0500, mitcheloc wrote: Isaac, Thanks for the quick response. A quick google search didn't tell me how I can check the module parameters. What command or file should I check for this? It shall be in /etc/modprobe.conf or some file under /etc/modprobe.d. Exact location depends on your distribution. Look for a line that starts with options lnet . And as you requested: [EMAIL PROTECTED] ~]# ls /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/net/lustre ksocklnd.ko libcfs.ko lnet.ko lnet_selftest.ko The kqswlnd.ko is missing. Isaac ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
On Mon, Mar 10, 2008 at 11:38:33AM -0500, mitcheloc wrote: From modprobe.conf: options lnet networks=tcp0,elan0 If you don't have Quadrics Elan hardware, you can change it to: options lnet networks=tcp0 Otherwise, Where should kqswlnd.ko be coming from? you need to compile lustre with proper QsNet support. Isaac ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
Isaac, I checked my ethernet card and it didn't look like Quadrics hardware. [EMAIL PROTECTED] ~]# lspci | grep Ethernet 00:19.0 Ethernet controller: Intel Corporation 82566DM Gigabit Network Connection (rev 02) So I removed the parameter, rebooted and it worked like a charm! I wonder how that setting got into my modules.conf file. I checked on another CentOS system I set up and it is not there. It was probably inserted by some other DFS I was trying out. After changing modules.conf and rebooting: [EMAIL PROTECTED] ~]# modprobe lustre [EMAIL PROTECTED] ~]# dmesg Lustre: OBD class driver, [EMAIL PROTECTED] Lustre Version: 1.6.4.3 Build Version: 1.6.4.3-1969123116-PRISTINE-.usr.src.linux-2.6.18-53.1.14.el5.lustre Lustre: Added LNI [EMAIL PROTECTED] [8/256] Lustre: Accept secure, port 988 Lustre: Lustre Client File System; [EMAIL PROTECTED] Thanks hopefully I don't run into any other issues. Cheers, Mitchel On Mon, Mar 10, 2008 at 12:48 PM, Isaac Huang [EMAIL PROTECTED] wrote: On Mon, Mar 10, 2008 at 11:38:33AM -0500, mitcheloc wrote: From modprobe.conf: options lnet networks=tcp0,elan0 If you don't have Quadrics Elan hardware, you can change it to: options lnet networks=tcp0 Otherwise, Where should kqswlnd.ko be coming from? you need to compile lustre with proper QsNet support. Isaac ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
Hmm. I did run into this while trying llmount.sh. [EMAIL PROTECTED] tests]# pwd /usr/src/lustre-1.6.4.3/lustre/tests [EMAIL PROTECTED] tests]# sh llmount.sh Loading modules from /usr/src/lustre-1.6.4.3/lustre/tests/.. lnet options: 'networks=tcp0' FATAL: Module mgs not found. [EMAIL PROTECTED] tests]# dmesg -c [EMAIL PROTECTED] tests]# Does this mean I should add a ,mgs to networks=tcp0? On Mon, Mar 10, 2008 at 1:32 PM, mitcheloc [EMAIL PROTECTED] wrote: Isaac, I checked my ethernet card and it didn't look like Quadrics hardware. [EMAIL PROTECTED] ~]# lspci | grep Ethernet 00:19.0 Ethernet controller: Intel Corporation 82566DM Gigabit Network Connection (rev 02) So I removed the parameter, rebooted and it worked like a charm! I wonder how that setting got into my modules.conf file. I checked on another CentOS system I set up and it is not there. It was probably inserted by some other DFS I was trying out. After changing modules.conf and rebooting: [EMAIL PROTECTED] ~]# modprobe lustre [EMAIL PROTECTED] ~]# dmesg Lustre: OBD class driver, [EMAIL PROTECTED] Lustre Version: 1.6.4.3 Build Version: 1.6.4.3-1969123116-PRISTINE-.usr.src.linux-2.6.18-53.1.14.el5.lustre Lustre: Added LNI [EMAIL PROTECTED] [8/256] Lustre: Accept secure, port 988 Lustre: Lustre Client File System; [EMAIL PROTECTED] Thanks hopefully I don't run into any other issues. Cheers, Mitchel On Mon, Mar 10, 2008 at 12:48 PM, Isaac Huang [EMAIL PROTECTED] wrote: On Mon, Mar 10, 2008 at 11:38:33AM -0500, mitcheloc wrote: From modprobe.conf: options lnet networks=tcp0,elan0 If you don't have Quadrics Elan hardware, you can change it to: options lnet networks=tcp0 Otherwise, Where should kqswlnd.ko be coming from? you need to compile lustre with proper QsNet support. Isaac ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Modprobe lustre fails
Hmm. I did run into this while trying llmount.sh. [EMAIL PROTECTED] tests]# pwd /usr/src/lustre-1.6.4.3/lustre/tests http://1.6.4.3/lustre/tests [EMAIL PROTECTED] tests]# sh llmount.sh Loading modules from /usr/src/lustre-1.6.4.3/lustre/tests/ http://1.6.4.3/lustre/tests/.. lnet options: 'networks=tcp0' FATAL: Module mgs not found. [EMAIL PROTECTED] tests]# dmesg -c [EMAIL PROTECTED] tests]# Does this mean I should add a ,mgs to networks=tcp0? Can you verify if mgs module is exist? Run command by Isaac mentioned: ls /lib/modules/2.6.18-53.1.14.el5.lustre/kernel/fs/lustre rpm -ql lustre-modules If so, please try to modprobe mgs manually to see if any messages displayed. Jack ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre SNMP module
Hi Klaus, On Friday 07 March 2008 05:52:51 pm Klaus Steden wrote: I was asking that same question a few months ago. Yes, I remember you haven't been overwhelmed by answers. :\ I can send you my 1.6.2 spec file for reference ... That version also did not bundle the SNMP library, so I ended up building it by recompiling the whole set of Lustre RPMs to get what I needed, and then just dropped the DSO in place. That's exactly what I did, finally. I'm curious as to what metrics you see to be useful -- I wasn't sure what to look for, so while I installed the module, I haven't yet thought of good things to ask of it. So, from what I've seen in the MIB, the current SNMP module mainly report version numbers and free space information. I think it would also be useful to get activity metrics, the same kind of information which is in /proc/fs/lustre/llite/*/stats on clients (so we can see reads/writes and fs operations rates), in /proc/fs/lustre/obdfilter/*/stats on OSSes and in /proc/fs/lustre/mds/*/stats on MDSes. Actually, all the /proc/fs/lustre/*/**/stats could be useful, but I guess what precise metric is the most useful heavily depends on what you want to see. :) Cheers, -- Kilian ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss