Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote: I think I tried that before and it didn't help, but I will try it again. Thanks for the suggestion. Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. Can you both please report your lustre and kernel versions? I know you said latest Aaron, but some version numbers might be more solid to go on. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ko2iblnd panics in kiblnd_map_tx_descs
Hi Chris, To resolve your problem, please: 1. apply this patch to your lnet: https://bugzilla.lustre.org/attachment.cgi?id=15733 2. please make sure use this option while configure: --with-o2ib=/path/to/ofed 3. Copy /path/to/ofed/Module.symvers to your $LUSTRE before building Regards Liang Chris Worley wrote: I'm trying to port Lustre 1.6.4.2 to OFED 1.2.5.5 w/ the RHEL kernel 2.6.9.67.0.4. ksocklnd-based mounts work fine, but when I try to mount over IB, I get a panic in ko2iblnd in the transmit descriptor mapping routine: general protection fault: [1] SMP CPU 1 Modules linked in: ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) nfs(U) lockd(U) nfs_acl(U) sunrpc(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) dm_mod(U) ib_ipoib(U) md5(U) ipv6(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) aic79xx(U) e1000(U) ext3(U) jbd(U) raid0(U) mptscsih(U) mptsas(U) mptspi(U) mptscsi(U) mptbase(U) sd_mod(U) ata_piix(U) libata(U) scsi_mod(U) Pid: 5141, comm: modprobe Not tainted 2.6.9-67.0.4.EL-Lustre-1.6.4.2 RIP: 0010:[a04659d1] a04659d1{:ko2iblnd:kiblnd_map_tx_descs+225} RSP: :0102105d7cd8 EFLAGS: 00010286 RAX: a01e6b4e RBX: ff0010028000 RCX: 0001 RDX: 1000 RSI: 01020e705000 RDI: 0102154e2000 RBP: 0102102c4200 R08: R09: R10: R11: R12: R13: R14: R15: 0102102c4228 FS: 002a958a0b00() GS:8046ac00() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 002a9598200f CR3: 9fa08000 CR4: 06e0 Process modprobe (pid: 5141, threadinfo 0102105d6000, task 0102175e0030) Stack: 0102102c4080 0102102c4100 0102102c4200 0102179c2b86 0102177df400 010215548ac0 a0466fdf 0102179c2b85 Call Trace:a0466fdf{:ko2iblnd:kiblnd_startup+2239} a03043dc{:lnet:lnet_startup_lndnis+332} a02d2f38{:libcfs:cfs_alloc+40} a0305206{:lnet:LNetNIInit+278} a03fcb0a{:ptlrpc:ptlrpc_ni_init+106} 8012f9cd{default_wake_function+0} a03fcbfa{:ptlrpc:ptlrpc_init_portals+10} 8012f9cd{default_wake_function+0} a045f22b{:ptlrpc:init_module+267} 8014bc0a{sys_init_module+278} 8010f23e{system_call+126} Code: ff 50 08 eb 12 48 8b 3f b9 01 00 00 00 ba 00 10 00 00 e8 30 RIP a04659d1{:ko2iblnd:kiblnd_map_tx_descs+225} RSP 0102105d7cd8 Does this ring any bells? Otherwise, any debugging tips? Shane said that they get an oops if they compile with the version specific OFA tree. Is this the Oops? Thanks, Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Sure, we will provide you with more details of our installation but let me first say that, if recollection serves, we did not pull that number out of a hat. I believe that there is a formula in one of the lustre tuning manuals for calculating the recommended timeout value. I'll have to take a moment to go back and find it. Anyway, if you use that formula for our cluster, the recommended timeout value, I think, comes out to be *much* larger than 1000. Later this morning, we will go back and find that formula and share with the list how we came up w/ our timeout. Perhaps you can show us where we are going wrong. One more comment We just brought up our second large lustre file system. It is 80+ TB served by 24 OSTs on two (pretty beefy) OSSs. We just achieved over 2GB/sec of sustained (large block, sequential) I/O from an aggregate of 20 clients.Our design target was 1.0 GB/sec/OSS and we hit that pretty comfortably. That said, when we first mounted the new (1.6.4.2) file system across all 400 nodes in our cluster, we immediately started getting transport endpoint failures and evictions. We looked rather intensively for network/fabric problems (we have both o2ib and tcp nids) and could find none. All of our MPI apps are/were running just fine. The only way we could get rid of the evictions and transport endpoint failures was by increasing the timeout. Also, we knew to do this based on our experience with our first lustre file system (1.6.3 + patches) where we had to do the same thing. Like I said, a little bit later, Craig or I will post more details about our implementation. If we are doing something wrong with regard to this timeout business, I would love to know what it is. Thanks, Charlie Taylor UF HPC Center On Mar 4, 2008, at 4:04 PM, Brian J. Murrell wrote: On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote: I think I tried that before and it didn't help, but I will try it again. Thanks for the suggestion. Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. Can you both please report your lustre and kernel versions? I know you said latest Aaron, but some version numbers might be more solid to go on. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Well, go figure.We are running... Lustre: 1.6.4.2 on clients and servers Kernel: 2.6.18-8.1.14.el5Lustre (clients and servers) Platform: X86_64 (opteron 275s, mostly) Interconnect: IB, Ethernet IB Stack: OFED 1.2 We already posted our procedure for patching the kernel, building OFED, and building lustre so I don't think I'll go into that again.Like I said, we just brought a new file system online. Everything looked fine at first with just a few clients mounted. Once we mounted all 408 (or so), we started gettting all kinds of transport endpoint failures and the MGSs and OSTs were evicting clients left and right.We looked for network problems and could not find any of any substance.Once we increased the obd/lustre/ system timeout setting as previously discussed, the errors vanished.This was consistent with our experience with 1.6.3 as well.That file system has been online since early December. Both file systems appear to be working well. I'm not sure what to make of it.Perhaps we are just masking another problem. Perhaps there are some other, related values that need to be tuned.We've done the best we could but I'm sure there is still much about Lustre we don't know. We'll try to get someone out to the next class but until then, we're on our own, so to speak. Charlie Taylor UF HPC Center Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. I can confirm that at a recent large installation with several thousand clients, the default of 100 is in effect. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre MPI-IO performance on CNL
Hi, The I/O performance of CNL (as measured with IOR) seems quite different for a shared file, compared to the same with separated files. Here are some numbers on a smaller file system on XT system at ORNL. All files are striped to 72OSTs. I deliberately use a block size 8512m. 1. sample tests with separate files # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -F -o iortes Max Write: 9978.18 MiB/sec (10462.88 MB/sec) Max Read: 5612.78 MiB/sec (5885.43 MB/sec) 2. sample share file performance # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -o iortes Max Write: 6817.31 MiB/sec (7148.47 MB/sec) Max Read: 5591.98 MiB/sec (5863.62 MB/sec) In addition, using my experimental MPI-IO library, I noticed that enabling direct I/O can have various effects for I/O on CNL. 3. sample seprate files with direct I/O export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -F -k -o lustre:iortest Max Write: 9353.66 MiB/sec (9808.03 MB/sec) Max Read: 8269.28 MiB/sec (8670.97 MB/sec) 4. sample share file performance with direct IO # export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -k -o lustre:iortes Max Write: 9484.11 MiB/sec (9944.81 MB/sec) Max Read: 7929.63 MiB/sec (8314.81 MB/sec) It seems direct I/O helps quite a bit on the performance of parallel reads, but not on writes. The shared file mode appears to benefit more from direct write. While it is understandable that the client cache can play a big role here, I am not sure how it could help the share-file mode much better. Anybody can help with some explanations on the comparison between reads and writes and the same for shared-file and separated-files? Also let me know if I am not clear in my descriptions. -- Weikuan Yu + 1-865-574-7990 http://ft.ornl.gov/~wyu/ P.S.: What shown are the good numbers from several runs. So you may consider them as consistent results. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre official flock support
Hello! On Mar 5, 2008, at 11:33 AM, Joe Barjo wrote: While making my tests, I saw that the flock system call was not working. Googling aroung I found the flock option in the mount command, and it seems to work just fine. However, I've read in the documentation that flock will only be supported in 1.8 version of lustre. What is the current status of this? Is flock usable in production for 1.6.4.2? flock has a major flaw in a sense that it is not fd-attached, so once you open a file, get flock lock, fork and try to release the lock from child, the lock won't actually go away. posix locking (through fcntl) on the other hand should work just fine. Note that right now there are some assertions in the code that would kill the client if you issue locking call with unknown parameters (like command), I think samba does that. That code needs to be changed to just return error (in ll_file_flock, 2 occurrences), there is a separate patch somewhere in bugzilla, but I cannot find it immediately and it would be included with some changes I am preparing anyway. Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Installing Lustre on PowerPC (IBM pSeries)
Hello! On Mar 4, 2008, at 4:44 AM, gas5x1 wrote: Could you please advice me, how, if at all passible, is to install Lustre on IBM PPC64? I have already Lustre 1.6 installation working for Intel i386 and AMD Opteron nodes, and now would like to acess it from IBM clients. You just compile as normal and it should technically work. For missing segment.h problem you saw earlier, please apply patch from https://bugzilla.lustre.org/show_bug.cgi?id=14844 Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre MPI-IO performance on CNL
Marty Barnaby wrote: My, perhaps, misunderstanding was that a Lustre FS had a maximum lfs stripe-count of 160. Is this not a constant set in the LFS, but just some local configuration? Could you be more specific about the actual lfs stripe-count of the file or files you wrote? You're right on the maximal stripe-count, and 72 being local for my choice of the testing. The stripe count can have, but probably a little, for the relative comparisons between w/ or w/o direct I/O. --Weikuan ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre MPI-IO performance on CNL
What is the stripe_size of this test? 4M? If it is 4M, then transfer_size is a little bigger(64M). And we have seen this situation before, finally it seems because client hold too much lock in each write(because of lustre down-forward extent lock policy) which might block other client writing, so impact the parallel of the whole system. Maybe you could try decrease transfer size to stripe_size. Or increase stripe_size to 64M and see how is it? Yes, the situation between shared file and separated files has been seen before. But I have never seen an explanation regarding CNL. BTW, this performance difference between shared/separated stays the same, regardless what transfer size is. Anybody wants to post a reason regarding direct I/O too? --Weikuan ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre MPI-IO performance on CNL
Hi, Weikuan Yu wrote: Hi, The I/O performance of CNL (as measured with IOR) seems quite different for a shared file, compared to the same with separated files. Here are some numbers on a smaller file system on XT system at ORNL. All files are striped to 72OSTs. I deliberately use a block size 8512m. 1. sample tests with separate files # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -F -o iortes Max Write: 9978.18 MiB/sec (10462.88 MB/sec) Max Read: 5612.78 MiB/sec (5885.43 MB/sec) 2. sample share file performance # aprun -n 32 -N 1 ~/benchmarks/IOR-2.9.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -o iortes Max Write: 6817.31 MiB/sec (7148.47 MB/sec) Max Read: 5591.98 MiB/sec (5863.62 MB/sec) In addition, using my experimental MPI-IO library, I noticed that enabling direct I/O can have various effects for I/O on CNL. What is the stripe_size of this test? 4M? If it is 4M, then transfer_size is a little bigger(64M). And we have seen this situation before, finally it seems because client hold too much lock in each write(because of lustre down-forward extent lock policy) which might block other client writing, so impact the parallel of the whole system. Maybe you could try decrease transfer size to stripe_size. Or increase stripe_size to 64M and see how is it? Thanks WangDi 3. sample seprate files with direct I/O export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -F -k -o lustre:iortest Max Write: 9353.66 MiB/sec (9808.03 MB/sec) Max Read: 8269.28 MiB/sec (8670.97 MB/sec) 4. sample share file performance with direct IO # export MPIO_DIRECT_WRITE=true; export MPIO_DIRECT_READ=true; aprun -n 32 -N 1 ~/benchmarks/IOR-2.10.1/src/C/IOR -a MPIIO -b 8512m -t 64m -d 1 -i 2 -w -r -g -k -o lustre:iortes Max Write: 9484.11 MiB/sec (9944.81 MB/sec) Max Read: 7929.63 MiB/sec (8314.81 MB/sec) It seems direct I/O helps quite a bit on the performance of parallel reads, but not on writes. The shared file mode appears to benefit more from direct write. While it is understandable that the client cache can play a big role here, I am not sure how it could help the share-file mode much better. Anybody can help with some explanations on the comparison between reads and writes and the same for shared-file and separated-files? Also let me know if I am not clear in my descriptions. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
On Wed, 2008-03-05 at 13:37 -0500, Aaron Knister wrote: Could you tell me what version of OFED was being used? Was it the version that ships with the kernel? OFED version is 1.2.5.4 -Aaron On Mar 5, 2008, at 11:33 AM, Frank Leers wrote: On Wed, 2008-03-05 at 11:08 -0500, Aaron Knister wrote: That's very strange. What interconnect is that site using? Not really strange, but - SDR IB/OFED lustre 1.6.4.2 2.6.18.8 clients 2.6.9-55.0.9 servers My versions are - Lustre - 1.6.4.2 Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp Kernel (clients) - 2.6.18-53.1.13.el5 On Mar 5, 2008, at 11:03 AM, Frank Leers wrote: On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote: On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote: I think I tried that before and it didn't help, but I will try it again. Thanks for the suggestion. Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. I can confirm that at a recent large installation with several thousand clients, the default of 100 is in effect. Can you both please report your lustre and kernel versions? I know you said latest Aaron, but some version numbers might be more solid to go on. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot send after transport endpoint shutdown (-108)
Are the clients SuSE, redhat or another distro? I can't get OFED 1.2.5.4 to build with rhel5 but im working on that. On Mar 5, 2008, at 2:03 PM, Frank Leers wrote: On Wed, 2008-03-05 at 13:37 -0500, Aaron Knister wrote: Could you tell me what version of OFED was being used? Was it the version that ships with the kernel? OFED version is 1.2.5.4 -Aaron On Mar 5, 2008, at 11:33 AM, Frank Leers wrote: On Wed, 2008-03-05 at 11:08 -0500, Aaron Knister wrote: That's very strange. What interconnect is that site using? Not really strange, but - SDR IB/OFED lustre 1.6.4.2 2.6.18.8 clients 2.6.9-55.0.9 servers My versions are - Lustre - 1.6.4.2 Kernel (servers) - 2.6.18-8.1.14.el5_lustre.1.6.4.2smp Kernel (clients) - 2.6.18-53.1.13.el5 On Mar 5, 2008, at 11:03 AM, Frank Leers wrote: On Tue, 2008-03-04 at 22:04 +0100, Brian J. Murrell wrote: On Tue, 2008-03-04 at 15:55 -0500, Aaron S. Knister wrote: I think I tried that before and it didn't help, but I will try it again. Thanks for the suggestion. Just so you guys know, 1000 seconds for the obd_timeout is very, very large! As you could probably guess, we have some very, very big Lustre installations and to the best of my knowledge none of them are using anywhere near that. AFAIK (and perhaps a Sun engineer with closer experience to some of these very large clusters might correct me) the largest value that the largest clusters are using is in the neighbourhood of 300s. There has to be some other problem at play here that you need 1000s. I can confirm that at a recent large installation with several thousand clients, the default of 100 is in effect. Can you both please report your lustre and kernel versions? I know you said latest Aaron, but some version numbers might be more solid to go on. b. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 [EMAIL PROTECTED] ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] lustre dstat plugin
I have wrote a lustre dstat plugin. You can find it on my blog: http://www.mlds-networks.com/index.php/component/option,com_mojo/ Itemid,29/p,31/ It only works on clients, and has not been tested on multiple mounts, Its very simple just reads /proc/ Example: dstat -a -M lustre total-cpu-usage -dsk/total- -net/total- ---paging-- --- system-- lustre-1.6- usr sys idl wai hiq siq| read writ| recv send| in out | int csw | read writ 23 53 1 21 0 0| 0 0 |3340k 4383k| 0 0 | 3476 198 | 16M 22M 13 69 16 2 0 1| 0 0 |1586k 16M| 0 0 | 3523 424 | 24M 14M 69 30 0 0 01| 0 8192B|1029k 18M| 0 0 | 3029 88 | 0 0 Patches/comments, Brock Palen Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss