Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( get_current()->journal_info == ((void *)0) ) failed
Hi Cédric, I'm by no means familiar with Lustre code anymore, but based on the stack trace and function names, it seems to be a problem with the journal. Maybe try to do an 'efsck -f' which would replay the journal and possibly clean up the file it has problem with. Cheers, Bernd On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - Idiap Research Institute wrote: > Hello, > > Last Friday, during normal operations, our MDS froze with the following > LBUG, which happens again as soon as one mounts the MDT again: > > Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError: > 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION( > get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28 > n00a kernel: [ 8414.612825] LustreError: > 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG > Sep 13 15:10:28 n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck > Sep 13 15:10:28 n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel: > [ 8414.619835] Call Trace: > Sep 13 15:10:28 n00a kernel: [ 8414.619850] [] > libcfs_debug_dumpstack+0x52/0x80 [libcfs] > Sep 13 15:10:28 n00a kernel: [ 8414.619857] [] > lbug_with_loc+0x42/0xa0 [libcfs] > Sep 13 15:10:28 n00a kernel: [ 8414.619864] [] > osd_trans_start+0x250/0x630 [osd_ldiskfs] > Sep 13 15:10:28 n00a kernel: [ 8414.619870] [] ? > osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs] > Sep 13 15:10:28 n00a kernel: [ 8414.619876] [] > lod_trans_start+0x177/0x200 [lod] > Sep 13 15:10:28 n00a kernel: [ 8414.619881] [] > lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck] > Sep 13 15:10:28 n00a kernel: [ 8414.619888] [] ? > thread_return+0x3e/0x10c > Sep 13 15:10:28 n00a kernel: [ 8414.619894] [] ? > enqueue_task_fair+0x58/0x5d > Sep 13 15:10:28 n00a kernel: [ 8414.619899] [] > lfsck_double_scan+0x5a/0x70 [lfsck] > Sep 13 15:10:28 n00a kernel: [ 8414.619904] [] > lfsck_master_engine+0x50d/0x650 [lfsck] > Sep 13 15:10:28 n00a kernel: [ 8414.619909] [] ? > lfsck_master_engine+0x0/0x650 [lfsck] > Sep 13 15:10:28 n00a kernel: [ 8414.619915] [] > kthread+0x7b/0x83 > Sep 13 15:10:28 n00a kernel: [ 8414.619918] [] ? > finish_task_switch+0x48/0xb9 > Sep 13 15:10:28 n00a kernel: [ 8414.619924] [] > child_rip+0xa/0x20 > Sep 13 15:10:28 n00a kernel: [ 8414.619928] [] ? > kthread+0x0/0x83 > Sep 13 15:10:28 n00a kernel: [ 8414.619931] [] ? > child_rip+0x0/0x20 > > > I originally had the LFSCK launched in "dry-run" mode: > > lctl lfsck_start --device lustre-1-MDT --dryrun on --type namespace > > The LFSCK was reported completed (I was 'watch[ing] -n 1' on a terminal) > before the LBUG popped-up; now, I can't even get any output > > cat /proc/fs/lustre/mdd/lustre-1-MDT/lfsck_namespace # just hang > there indefinitely > > > I remember seing a lfsck_namespace file in the MDT underlyding LDISKFS; > is there anything sensible I can do with it (e.g. would deleting it > solve the situation) ? > What else could I do ? > > > Thanks for your answers and best regards, > > Cédric D. > > > PS: I had this message originally posted on HPDD-discuss mailing list > and just realized it was the wrong place; sorry for any crossposting case > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] Patchless kernel support?
Hi all, I think Ashley means patchless server support. That is already tracked in bug#21524. Ashley, while patchless server suppoert certainly is a good idea, it might not always be as helpful as you believe. Updating the presently existing patches is usually rather straight forward. Far more difficult is if the VFS changes and new methods and configure checks have to be implemented in Lustre. That made it so difficult to update Lustre to 2.6.24 and now again the limit had been 2.6.32 (maybe the VFS changes already had been in before, but I didn't track linux-git recently that much). And those changes in the VFS are often also completely unrelated to kernel patches... I also planned to work on the sd_io stats. But I think that patch simply should be dropped in favour of blktrace. Current blkiomon does almost the same as the sd_io stats, but IMHO neither of both approaches is really helpful. So I have a modified blkiomon version (not ready for patch submission yet), that does similar stats as the DDN S2A controllers and IMHO only those detailed stats are really helpful to analyze IO patterns. If it comes to me, neither sd_io stats, nor DDN SFA nor upstream blkiomon have sufficiently detailed information to see where the problem is. I understand that blktrace has some overhead compared to sd_io stats. However, if sd_io stats is supposed to ever land upstream, it needs to be rewritten from procfs to debugfs. I think even sysfs is not suitable for it. Cheers, Bernd On Thursday, November 25, 2010, Alexey Lyashkov wrote: Ashley, I don't clearry understand what you want, if you say about patchless support on client - typical size of adding support of one new kernel to pachless client is ~40kb of patch for lustre. Sometimes is has more work, sometimes less. As last lustre supported kernel is 2.6.32 - you should be plan to have ~150kb patch for 2.6.37 kernel support. if you say about patchless kernel support - yes, that is possible, but that is need more work and submiting lots patches in kernel upstream. On Nov 25, 2010, at 15:18, Ashley Pittman wrote: Picking up from something that was said at SC last week I believe it was Andreas that mentioned the possibility of patch-less kernel support. This is something that would be immensely useful to us for a variety of reasons. Has there been any recent work into investigating how much work would be involved in implementing this and what's the feeling for if it could be done though changes to Lustre only or a case of submitting a number of patches upstream? Ashley. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS problem after upgrade to 1.8.3
Hello Tina, On Friday, November 12, 2010, Tina Friedrich wrote: Hello List, we re-export the file system via NFS for a couple of things. All the re-exporters are Red Hat 5.5 servers running kernel 2.6.18-194.17.1.el5 (patchless clients). that is your problem. You MUST use a patched version, or least a kernel with 8kB stack size. RHEL5 has 4kB by default, which is not sufficient and therefore in early 1.8 versions a patch landed that disallowed NFS exports. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS problem after upgrade to 1.8.3
Hello Tina, On 11/12/2010 03:44 PM, Tina Friedrich wrote: Hello again, nope, running with / exporting from a server with the patched kernel running does not change this behaviour at all. mountvers=3 works, 1 and 2 don't. I can reproduce it, so NFSv2 support got broken. Which issue has higher priority, tar or NFSv2? Cheers, Bernd signature.asc Description: OpenPGP digital signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Serious error: objid already exists; is this filesystem corrupt?
Hello Christopher, hello Alex, the alternative is to let e2fsck correct LAST_ID. Patches are here: https://bugzilla.lustre.org/show_bug.cgi?id=22734 and included in our e2fsprogs releases: http://eu.ddn.com:8080/lustre/lustre/RHEL5/tools/e2fsprogs/ Unfortunately, the patches are not yet in Oracle e2fsprogs version. In order to let e2fsck correct it, you will need to create an mdsdb file (the hdr is sufficient) and then e2fsck --mdsdb mdsdb.hdr --ostdb some_irrelevant_file /dev/device The procedure is similar to the lfsck preparations, although one usually runs that with -n. To let e2fsck (pass6, the db-part) correct the LAST_ID, it must *not* run in read-only mode, though. Cheers, Bernd On Thursday, November 04, 2010, Alexey Lyashkov wrote: Hi Christopher, you need kill lov_objid file on MDS and set LAST_ID on OST to 870397. in that case MDS will reread last_id from OST's and refill lov_objid file, to avoid possible file corruption. On Nov 4, 2010, at 04:22, Christopher Walker wrote: We recently had a hardware failure on one of our OSTs, which has caused some major problems for our 1.6.6-based array. We're now getting the error: Serious error: objid 517386 already exists; is this filesystem corrupt? on one of our OSTs. If I mount this OST as ldiskfs and look in O/0/d*, the highest objid I see is 870397, considerably higher than 517386. We've taken this OST through a round of e2fsck and ll_recover_lost_found_objs, during which it restored a lot of lost files, and e2fsck on this OST and on the MDT don't currently show any problems. Can I simply edit O/0/LAST_ID, set it to 870397, and expect files with objid between 517386 and 870397 to come back? Also, I could be wrong, but it looks like ll_recover_lost_found_objs.c only looks for lost files up to LAST_ID -- if I reset LAST_ID to 870397, should I rerun ll_recover_lost_found_objs? Many thanks in advance, Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recovering formatted OST
Hello Wojciech, I think both would work, but why don't just create a small OST with mkfs.lustre on a loopback device? And then copy over those files to your recovered filesystem. Hmm, well, e2fsck might not have fixed all issues and then a reformat indeed might be helpful. Also note: EAs on OST objects are a nice to have, but not absolutely required. Cheers, Bernd On Tuesday, October 26, 2010, Wojciech Turek wrote: Bernd, I would like to clarify if I understood you suggestion correctly: 1) create a new OST but using old index and old label 2) mount it as ldiskfs and copy recovered objects (using tar or rsync with xattrs support) from the old OST to the new OST 3) run --writeconf on MDT and OST of that filesystem 4) mount MDT and all OSTs I guess I could do it also that way: 1) backup restored object using tar or rsync with xattrs support 2) format old OST with old index and old label 3) restore Objects from the backup Do you think that would work? Best regards, Wojciech On 22 October 2010 18:52, Bernd Schubert bernd.schub...@fastmail.fm wrote: Hmm, I would probably format a small fake device on a ramdisk and copy files over, run tunefs --writeconf /mdt and then start everything (inlcuding all OSTs) again. Cheers, On Friday, October 22, 2010, Wojciech Turek wrote: I have tried Bernd's suggestion and it seem to have worked, after running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic but moved a number of objects to O directory. Problem is that I do not have last_rcvd file so the OST has no index at the moment. What would be the next step to enable access to those files in the filesystem? Best regards, Wojciech On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com wrote: On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm wrote: Hmm, e2fsck didn't catch that? rec_len is the length of a directory entry, so after how many bytes the next entry follows. I agree that e2fsck should have caught that. You can try to force e2fsck to do something about that: e2fsck -D No, I would recommend against using -D at this point. That will cause it to re-write the directory contents, and given that the filesystem was previously corrupted I would prefer making as few changes as possible before the data is estranged. Wojciech, note that if you are able to mount the filesystem you could just copy all of the objects (with xattrs!) from lost+found on the bad filesystem, along with the last_rcvd file (if you can find it) into a new ldiskfs filesystem and then run ll_recover_lost_found_objs on that. On Friday, October 22, 2010, Wojciech Turek wrote: Ok, removing and recreating the journal fixed that problem and I am able to mount device as ldiskfs filesystem. Now I hit another wall when trying to run ll_recover_lost_found_objs When I first time run ll_recover_lost_found_objs -d /mnt/ost/lost+found it only creates the O dir and exits. When I repeat this command again kernel panics. Any idea what could be the problem here? LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in directory #6831: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Aborting journal on device dm-4. Unable to handle kernel NULL pointer dereference at RIP: [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db PGD 1a118d067 PUD 1ce7e7067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /class/infiniband_mad/umad0/port CPU 3 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U) mptctl(U) dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U) usb_storage(U) pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm
Re: [Lustre-discuss] recovering formatted OST
On Tuesday, October 26, 2010, Wojciech Turek wrote: Hi, There is a LAST_ID file on the OST and indeed it equals a highest object number [r...@oss09 ~]# od -Ax -td8 /tmp/LAST_ID 00 2490599 08 [r...@oss09 ~]# ls -1s /mnt/ost/O/0/d* | grep -v [a-z] | sort -k2 -n | tail -1 8 2490599 However MDS seem to think differently. r...@mds03 ~]# lctl get_param osc.*.prealloc_last_id | grep OST0010 osc.scratch2-OST0010-osc.prealloc_last_id=1 Yeah. Is this caused by deactivating the OST on the MDS? I have deactivated OST on MDS using this command: lctl --device 19 conf_param scratch2-OST0010.osc.active=0 I looked into lov_objid reported by the MDS but I am not sure how to interpret the output correctly [r...@mds03 ~]# od -Ax -td8 /tmp/lov_objid 00 2073842 2100049 10 2115247 2038471 20 2119821 2190996 30 2029234 2354424 40 2160856 2167105 50 1970351 2059045 60 2706486 2571655 70 2662262 2628346 80 2490688 2668926 90 2631587 2643791 a0 So my question is how I can find out if my LAST_ID is fine? Above you deactivated OST0010 (hex), so OST-16 in decimal (counting starts with zero). That should be 2490688 then. I still wonder if we could convince e2fsck to set that last_id value on the OST itself. It already can correct the wrong last_id value, but it sets that to the last_id it finds on disk (https://bugzilla.lustre.org/show_bug.cgi?id=22734). Setting it to the MDS value should also work, but firstly for sanity reasons it falls back to the on disk value, if the values differ too much (1) and secondly I figured out with those patches there, that using the MDS value is broken (and did not get broken by patches, but my patches revealed it...). Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recovering formatted OST
Hello Lisa, OST-index and the fsname identify the OST for the MGS, MGS and clients. If you reformat an OST and you do not re-use the old index, it will leave a hole, as the new OST gets another index. And OST holes are an uncommon scenario, it often triggers some bugs... Cheers, Bernd On Tuesday, October 26, 2010, Lisa Giacchetti wrote: Wojciech, since you have successfully done step #4 can you tell me what is use in the reformat for the old index id? I tried to do this a few weeks ago was not succsessful at reformatting an ost with the old index because I am not clear on what the index is. I asked on this list at that time for input and did not get much. If you could provide the exact command you used that would be good too. lisa On 10/26/10 10:31 AM, Wojciech Turek wrote: Since some of our users started to recover their data from backups or by other means (rerunning jobs etc) into the original locations I don't think it would be good idea to put the recovered OST back in service as it is, as that may cause some of users new files to be overwritten by the recovered files. To avoid that scenario I decided to reformat the old OST and put it back into filesystem as empty. 1) First I have created a backup of the recovered object files 2) then using lfs find and lfs getstripe on the client I created a list of files and object ids from the formatted OST 3) using backup from point 1 and information from point 2 I copied objects to a new location on the filesystem and renamed them to their original name. Now users can interrogate those files and choose which they want to keep. 4) I reformatted old OST with old index id and old label Before I mount that OST into filesystem I want to make sure that MDS detects it as empty OST and does not try to recreate missing objects. Would it be enough to remove lov_objid from MDT and let it create new lov_objid based on information from OSTs, or do I need to first unlink all missing files from the client? Best regards, Wojciech On 26 October 2010 05:36, Wojciech Turek wj...@cam.ac.uk mailto:wj...@cam.ac.uk wrote: Bernd, I would like to clarify if I understood you suggestion correctly: 1) create a new OST but using old index and old label 2) mount it as ldiskfs and copy recovered objects (using tar or rsync with xattrs support) from the old OST to the new OST 3) run --writeconf on MDT and OST of that filesystem 4) mount MDT and all OSTs I guess I could do it also that way: 1) backup restored object using tar or rsync with xattrs support 2) format old OST with old index and old label 3) restore Objects from the backup Do you think that would work? Best regards, Wojciech On 22 October 2010 18:52, Bernd Schubert bernd.schub...@fastmail.fm mailto:bernd.schub...@fastmail.fm wrote: Hmm, I would probably format a small fake device on a ramdisk and copy files over, run tunefs --writeconf /mdt and then start everything (inlcuding all OSTs) again. Cheers, On Friday, October 22, 2010, Wojciech Turek wrote: I have tried Bernd's suggestion and it seem to have worked, after running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic but moved a number of objects to O directory. Problem is that I do not have last_rcvd file so the OST has no index at the moment. What would be the next step to enable access to those files in the filesystem? Best regards, Wojciech On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com mailto:andreas.dil...@oracle.com wrote: On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm mailto:bernd.schub...@fastmail.fm wrote: Hmm, e2fsck didn't catch that? rec_len is the length of a directory entry, so after how many bytes the next entry follows. I agree that e2fsck should have caught that. You can try to force e2fsck to do something about that: e2fsck -D No, I would recommend against using -D at this point. That will cause it to re-write the directory contents, and given that the filesystem was previously corrupted I would prefer making as few changes
Re: [Lustre-discuss] 1.8 quotas
Hello Jason, please note that it is also possible to enable quotas using lctl and that would not be visible using tunefs.lustre. I think the only real option to check if quotas are enabled is to check if quota file exist. For an online filesystem 'debugfs -c /dev/device' is probably the safest way (there is also a 'secret' way how to bind mount the underlying ldiskfs to another directory, but I only use that for test filesystems and never in production, as have not verified the kernel code path yet). Either way, you should check for lquota files, such as r...@rhel5-nfs@phys-oss0:~# mount -t ldiskfs /dev/mapper/ost_demofs_2 /mnt r...@rhel5-nfs@phys-oss0:~# ll /mnt [...] -rw-r--r-- 1 root root 7168 Oct 23 09:48 lquota_v2.group -rw-r--r-- 1 root root 71680 Oct 23 09:48 lquota_v2.user (Of course, you should check that for those OST which have reported the slow quota messages). I just poked around a bit in the code and above the fsfilt_check_slow() check, there is also a loop that calls filter_range_is_mapped(). Now this function calls fs_bmap() and when that eventually goes to down to ext3, it might get a bit slow if, if another thread should modify that file (check out linux/fs/inode.c): /* * bmap() is special. It gets used by applications such as lilo and by * the swapper to find the on-disk block of a specific piece of data. * * Naturally, this is dangerous if the block concerned is still in the * journal. If somebody makes a swapfile on an ext3 data-journaling * filesystem and enables swap, then they may get a nasty shock when the * data getting swapped to that swapfile suddenly gets overwritten by * the original zero's written out previously to the journal and * awaiting writeback in the kernel's buffer cache. * * So, if we see any bmap calls here on a modified, data-journaled file, * take extra steps to flush any blocks which might be in the cache. */ I don't know though, if it can happen that several threads write to the same file. But if it happens, it gets slow. I wonder if a possible swap file is worth the efforts here... In fact, the reason to call filter_range_is_mapped() certainly does not require a journal flush in that loop. I will check myself next week, if journal flushes are ever made due to that and open a Lustre bugzilla then. Avoiding all of that should not be difficult Cheers, Bernd On Saturday, October 23, 2010, Jason Hill wrote: Kevin/Dave/(and Dave from DDN): Thanks for your replies. From tunefs.lustre --dryrun it is very apparent that we are not running quotas. Thanks for your assistance. That message, from lustre/obdfilter/filter_io_26.c, is the result of the thread taking 35 second from when it entered filter_commitrw_write() until after it called lquota_chkquota() to check the quota. However, it is certainly plausible that the thread was delayed because of something other than quotas, such as an allocation (eg, it could have been stuck in filter_iobuf_get). Kevin ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hello Michael, On Saturday, October 23, 2010, Michael Kluge wrote: Hi Bernd, I get the same message with you kernel RPMS: In file included from include/linux/list.h:6, from include/linux/mutex.h:13, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:36 : /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC 6/include/linux/stddef.h:9: error: redeclaration of enumerator 'false' include/linux/stddef.h:16: error: previous definition of 'false' was here /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6 /include/linux/stddef.h:11: error: redeclaration of enumerator 'true' include/linux/stddef.h:18: error: previous definition of 'true' was here Could it be that this '2.6.18 being almost an 2.6.28/29' confuses the OFED backports and the 2.6.18 backport does not work anymore? Is that solvable? I found nothing in the OFED bugzilla. somewhere there is a support matrix, which OFED version supports which RHEL version, but I also would need to search for it. Anyway, ofed-1.4 is already included in 2.6.18-164. So no need for any additional compilations. 2.6.18-194 (RHEL5.5) also still mostly has OFED-1.4, but with an important mellanox driver backport (you will still additionally need a beta version version to get reliably QDR with recent chips). So if you have mellanox QDR HCAs and your connection is flaky in between SDR and QDR, just compile OFED-1.5, if works fine with Lustre (fortunately recently no interfaces changes anymore). But still make sure you compile Lustre against that stack... I also just updated our download page a bit and also uploaded sources for kernel, lustre, tar and e2fsprogs. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
On Friday, October 22, 2010, Michael Kluge wrote: Hi list, DID_BUS_BUSY means that the controller is unable to handle the SCSI command and is basically asking the host to send it again later. I had I think just one concurrent region and 32 threads running. What would be the appropriate action in this case? Reducing the queue depth on the HBA? We have Qlogic here, there is an option for the kernel module for this. I think you run into a known issue with the Q-Logic driver an the SFA10K. You will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal numbers of commands is likely to be 16 (with 4 OSS connected). Hope it helps, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recovering formatted OST
Hmm, e2fsck didn't catch that? rec_len is the length of a directory entry, so after how many bytes the next entry follows. You can try to force e2fsck to do something about that: e2fsck -D Cheers, Bernd On Friday, October 22, 2010, Wojciech Turek wrote: Ok, removing and recreating the journal fixed that problem and I am able to mount device as ldiskfs filesystem. Now I hit another wall when trying to run ll_recover_lost_found_objs When I first time run ll_recover_lost_found_objs -d /mnt/ost/lost+found it only creates the O dir and exits. When I repeat this command again kernel panics. Any idea what could be the problem here? LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in directory #6831: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Aborting journal on device dm-4. Unable to handle kernel NULL pointer dereference at RIP: [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db PGD 1a118d067 PUD 1ce7e7067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /class/infiniband_mad/umad0/port CPU 3 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U) mptctl(U) dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U) usb_storage(U) pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G 2.6.18-194.3.1.el5_lustre.1.8.4 #1 RIP: 0010:[88033448] [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db RSP: 0018:8101c6481d90 EFLAGS: 00010246 RAX: RBX: RCX: RDX: RSI: 8101e9dab0c0 RDI: 81022fa46000 RBP: 81022fa46000 R08: 81022fa46068 R09: R10: 810105925b20 R11: fffa R12: R13: R14: 8101e9dab0c0 R15: FS: () GS:810107b9a4c0() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 0001eaffb000 CR4: 06e0 Process kjournald (pid: 11360, threadinfo 8101c648, task 81021c14c0c0) Stack: 8101a61b9000 2b8263c0 113b0001 0013 0111 01282dd7 20dd Call Trace: [8003da91] lock_timer_base+0x1b/0x3c [8004b347] try_to_del_timer_sync+0x7f/0x88 [88037386] :jbd:kjournald+0xc1/0x213 [800a0ab2] autoremove_wake_function+0x0/0x2e [800a089a] keventd_create_kthread+0x0/0xc4 [880372c5] :jbd:kjournald+0x0/0x213 [800a089a] keventd_create_kthread+0x0/0xc4 [80032890] kthread+0xfe/0x132 [8005dfb1] child_rip+0xa/0x11 [800a089a] keventd_create_kthread+0x0/0xc4 [8014bcf4] deadline_queue_empty+0x0/0x23 [80032792] kthread+0x0/0x132 [8005dfa7] child_rip+0x0/0x11 Code: f0 0f ba 33 01 e8 42 fc 02 f8 8b 03 a8 04 75 07 8b 43 58 85 RIP [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db RSP 8101c6481d90 CR2: 0Kernel panic - not syncing: Fatal exception On 22 October 2010 03:09, Andreas Dilger andreas.dil...@oracle.com wrote: On 2010-10-21, at 18:44, Wojciech Turek wj...@cam.ac.uk wrote: fsck has finished and does not find any more errors to correct. However when I try to mount the device as ldiskfs kernel panics with following message: Assertion failure in cleanup_journal_tail() at fs/jbd/checkpoint.c:459: blocknr != 0 Hmm, not sure, maybe your journal is broken? You can delete it with tune2fs -O ^has_journal (maybe after running e2fsck again to clear the journal), then re-create it with tune2fs -j. --- [cut here ] - [please bite here ] - Kernel BUG at fs/jbd/checkpoint.c:459 invalid opcode: [1] SMP last sysfs file: /class/infiniband_mad/umad0/ port CPU 2 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U) hidp(U) l2cap(U)
Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hello Michael, I'm sorry to hear that. Unfortunately, I really do not have the time to port this version to your kernel version. I remember that you use Debian. But I guess you are still using a SLES kernel then? You could ask Suse about it, although I guess they only do care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 kernel should work (and besides its name, it is internally more or less a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more recent udev, which requires at least 2.6.27. You could also ask our support department, if they have any news for 2.6.27. I'm in Lustre engineering and as we only support RHEL5 right now, I so far did not care about other kernel versions too much. If all doesn't help, you will need to set the queue depth to 1, but that will also impose a big performance hit :( Cheers, Bernd On Friday, October 22, 2010, Michael Kluge wrote: Hi Bernd, I have found a RHEL-only release for this version. It does not compile on a 2.6.27 kernel :( I actually don't want to go back to 2.6.18 just to get a new driver. Michael Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert: On Friday, October 22, 2010, Michael Kluge wrote: Hi list, DID_BUS_BUSY means that the controller is unable to handle the SCSI command and is basically asking the host to send it again later. I had I think just one concurrent region and 32 threads running. What would be the appropriate action in this case? Reducing the queue depth on the HBA? We have Qlogic here, there is an option for the kernel module for this. I think you run into a known issue with the Q-Logic driver an the SFA10K. You will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal numbers of commands is likely to be 16 (with 4 OSS connected). Hope it helps, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recovering formatted OST
Hmm, I would probably format a small fake device on a ramdisk and copy files over, run tunefs --writeconf /mdt and then start everything (inlcuding all OSTs) again. Cheers, On Friday, October 22, 2010, Wojciech Turek wrote: I have tried Bernd's suggestion and it seem to have worked, after running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic but moved a number of objects to O directory. Problem is that I do not have last_rcvd file so the OST has no index at the moment. What would be the next step to enable access to those files in the filesystem? Best regards, Wojciech On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com wrote: On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm wrote: Hmm, e2fsck didn't catch that? rec_len is the length of a directory entry, so after how many bytes the next entry follows. I agree that e2fsck should have caught that. You can try to force e2fsck to do something about that: e2fsck -D No, I would recommend against using -D at this point. That will cause it to re-write the directory contents, and given that the filesystem was previously corrupted I would prefer making as few changes as possible before the data is estranged. Wojciech, note that if you are able to mount the filesystem you could just copy all of the objects (with xattrs!) from lost+found on the bad filesystem, along with the last_rcvd file (if you can find it) into a new ldiskfs filesystem and then run ll_recover_lost_found_objs on that. On Friday, October 22, 2010, Wojciech Turek wrote: Ok, removing and recreating the journal fixed that problem and I am able to mount device as ldiskfs filesystem. Now I hit another wall when trying to run ll_recover_lost_found_objs When I first time run ll_recover_lost_found_objs -d /mnt/ost/lost+found it only creates the O dir and exits. When I repeat this command again kernel panics. Any idea what could be the problem here? LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in directory #6831: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Aborting journal on device dm-4. Unable to handle kernel NULL pointer dereference at RIP: [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db PGD 1a118d067 PUD 1ce7e7067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /class/infiniband_mad/umad0/port CPU 3 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U) mptctl(U) dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U) usb_storage(U) pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G 2.6.18-194.3.1.el5_lustre.1.8.4 #1 RIP: 0010:[88033448] [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db RSP: 0018:8101c6481d90 EFLAGS: 00010246 RAX: RBX: RCX: RDX: RSI: 8101e9dab0c0 RDI: 81022fa46000 RBP: 81022fa46000 R08: 81022fa46068 R09: R10: 810105925b20 R11: fffa R12: R13: R14: 8101e9dab0c0 R15: FS: () GS:810107b9a4c0() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: 0001eaffb000 CR4: 06e0 Process kjournald (pid: 11360, threadinfo 8101c648, task 81021c14c0c0) Stack: 8101a61b9000 2b8263c0 113b0001 0013 0111 01282dd7 20dd Call Trace: [8003da91] lock_timer_base+0x1b/0x3c [8004b347] try_to_del_timer_sync+0x7f/0x88 [88037386] :jbd:kjournald+0xc1/0x213 [800a0ab2] autoremove_wake_function+0x0/0x2e [800a089a] keventd_create_kthread+0x0/0xc4 [880372c5] :jbd:kjournald+0x0/0x213
Re: [Lustre-discuss] recovering formatted OST
Er no, mkfs.lustre --index=${the_right_index}. Cheers, Bernd On Friday, October 22, 2010, Wojciech Turek wrote: Ok, but this means that new OST will come up with a new index (next available). Maybe this is a stupid question, but how MDS will know that the missing files are residing now on a new OST? On 22 October 2010 18:52, Bernd Schubert bernd.schub...@fastmail.fm wrote: Hmm, I would probably format a small fake device on a ramdisk and copy files over, run tunefs --writeconf /mdt and then start everything (inlcuding all OSTs) again. Cheers, On Friday, October 22, 2010, Wojciech Turek wrote: I have tried Bernd's suggestion and it seem to have worked, after running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic but moved a number of objects to O directory. Problem is that I do not have last_rcvd file so the OST has no index at the moment. What would be the next step to enable access to those files in the filesystem? Best regards, Wojciech On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com wrote: On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm wrote: Hmm, e2fsck didn't catch that? rec_len is the length of a directory entry, so after how many bytes the next entry follows. I agree that e2fsck should have caught that. You can try to force e2fsck to do something about that: e2fsck -D No, I would recommend against using -D at this point. That will cause it to re-write the directory contents, and given that the filesystem was previously corrupted I would prefer making as few changes as possible before the data is estranged. Wojciech, note that if you are able to mount the filesystem you could just copy all of the objects (with xattrs!) from lost+found on the bad filesystem, along with the last_rcvd file (if you can find it) into a new ldiskfs filesystem and then run ll_recover_lost_found_objs on that. On Friday, October 22, 2010, Wojciech Turek wrote: Ok, removing and recreating the journal fixed that problem and I am able to mount device as ldiskfs filesystem. Now I hit another wall when trying to run ll_recover_lost_found_objs When I first time run ll_recover_lost_found_objs -d /mnt/ost/lost+found it only creates the O dir and exits. When I repeat this command again kernel panics. Any idea what could be the problem here? LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in directory #6831: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Aborting journal on device dm-4. Unable to handle kernel NULL pointer dereference at RIP: [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db PGD 1a118d067 PUD 1ce7e7067 PMD 0 Oops: 0002 [1] SMP last sysfs file: /class/infiniband_mad/umad0/port CPU 3 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U) bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U) mptctl(U) dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U) i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U) usb_storage(U) pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U) mptbase(U) scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U) sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G 2.6.18-194.3.1.el5_lustre.1.8.4 #1 RIP: 0010:[88033448] [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db RSP: 0018:8101c6481d90 EFLAGS: 00010246 RAX: RBX: RCX: RDX: RSI: 8101e9dab0c0 RDI: 81022fa46000 RBP: 81022fa46000 R08: 81022fa46068 R09: R10: 810105925b20 R11: fffa R12: R13: R14: 8101e9dab0c0 R15: FS: () GS:810107b9a4c0() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0
Re: [Lustre-discuss] recovering formatted OST
On Friday, October 22, 2010, Andreas Dilger wrote: On 2010-10-22, at 12:25, Wojciech Turek wrote: Actually I remember now, Andreas wrote some time ago that when one adds OST in to the same slot as the old one MDS will think that the OST have objects up to the what old OST had, and when the new OST starts it will recreate those objects which may use a lot of inodes and space. So loop device or ramdisk maybe not enough for that? The ll_recover_lost_found_objs will at least recreate the O/0/LAST_ID file with the highest-available object ID, but given the corruption of the filesystem this may not cover all of the objects previously created. I would suggest to read the last_id for this OST from the MDS: mds lctl get_param osc.*.prealloc_last_id and then use a binary editor to set the LAST_ID on the recovered OST, if it is significantly different. Hmm, if you remember, I have in my last_id patch a TODO: new tool?. What about simply manually creating an empty file on the OST with that ID (in the right obj-id % 32) directory and then let e2fsck do the job (I guess our DDN e2fsck is the only one which can do that so far). Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] recovering formatted OST
Hello Wojciech Turek, On Thursday, October 21, 2010, Wojciech Turek wrote: Hi Andreas, I have restarted fsck after the segfault and it ran for several hours and it segfaulted again. Pass 3A: Optimizing directories Failed to optimize directory ??? (73031): EXT2 directory corrupted Failed to optimize directory ??? (73041): EXT2 directory corrupted Failed to optimize directory ??? (75203): EXT2 directory corrupted Failed to optimize directory ??? (75357): EXT2 directory corrupted Failed to optimize directory ??? (75744): EXT2 directory corrupted Failed to optimize directory ??? (75806): EXT2 directory corrupted Failed to optimize directory ??? (75825): EXT2 directory corrupted Failed to optimize directory ??? (75913): EXT2 directory corrupted Failed to optimize directory ??? (75926): EXT2 directory corrupted Failed to optimize directory ??? (76034): EXT2 directory corrupted Failed to optimize directory ??? (76083): EXT2 directory corrupted Failed to optimize directory ??? (76142): EXT2 directory corrupted Failed to optimize directory ??? (76266): EXT2 directory corrupted Failed to optimize directory ??? (76501): EXT2 directory corrupted Failed to optimize directory ??? (77133): EXT2 directory corrupted Failed to optimize directory ??? (77212): EXT2 directory corrupted Failed to optimize directory ??? (77817): EXT2 directory corrupted Failed to optimize directory ??? (77984): EXT2 directory corrupted Failed to optimize directory ??? (77985): EXT2 directory corrupted Segmentation fault Maybe try to disable dirindex? I noticed that the stack limit was quite low so I now changed it to unlimited, also I increased limit for number of open files (maybe it can help). Now I have another problem. After last segfault I can not restart the fsck due to MMP. e2fsck -fy /dev/scratch2_ost16vg/ost16lv e2fsck 1.41.10.sun2 (24-Feb-2010) e2fsck: MMP: fsck being run while trying to open /dev/scratch2_ost16vg/ost16lv The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 32768 device Also when I try to access filesystem via debugfs it fails: debugfs -c -R 'ls' /dev/scratch2_ost16vg/ost16lv debugfs 1.41.10.sun2 (24-Feb-2010) /dev/scratch2_ost16vg/ost16lv: MMP: fsck being run while opening filesystem ls: Filesystem not open Is there a way to clear teh MMP flag so it allows fsck to run? you can try tune2fs -f -E clear-mmp However, with a corrupted filesystem, that might not work. You can download a fixed e2fsprogs from my homepage, that does allow to run read-only operations (such as 'debugfs -c' or 'dumpe2fs -h') in read-only mode. Then you check which block is the MMP block and zero that. http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/ (just reminds me, I need to upload it to our DDN download site) Also, do you really want to use data files, that might have been zeroed in their middle? I think If at all your recovery will only be useful for small human readable text files Hope it helps, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] high CPU load limits bandwidth?
That is normal and probably comes from the page cache, should be about the same for lustre, ldiskfs, ext4, xfs, etc. It goes down if you specify -odirect, but which is obviously not optimal on Lustre clients. Cheers, Bernd On Wednesday, October 20, 2010, Andreas Dilger wrote: Is this client CPU or server CPU? If you are using Ethernet it will definitely be CPU hungry and can easily saturate a single core. Cheers, Andreas On 2010-10-20, at 8:41, Michael Kluge michael.kl...@tu-dresden.de wrote: Hi list, is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre file system shows up with a 100% CPU load within 'top'? The reason why I am asking this is that I can write from one client to one OST with 500 MB/s. The CPU load will be at 100% in this case. If I stripe over two OSTs (which use different OSS servers and different RAID controllers) I will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will be at 100% again. A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10% CPU load. Are there ways to tune this behavior? Changing max_rpcs_in_flight and max_dirty_mb did not help. Regards, Michael ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs
On Wednesday, October 20, 2010, Charland, Denis wrote: Brian J. Murrell wrote: On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote: This is why the recommendations in this thread have continued to be using a number of data disks that divides evenly into 1MB (i.e. powers of 2: 2, 4, 8, etc.). So for RAID6: 4+2 or 8+2, etc. What about RAID5? Personally I don't lile raid5 too much, but with raid5 it is obviously +1 instead of +2 Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs performance vs. XFS performance
For your final final filesystem you still probably want to enable async journals (unless you are willing to enable the S2A unmirrored device cache). Most obdecho/obdfilter-survey bugs are gone in 1.8.4, except your ctrl+c problem, for which a patch exists: https://bugzilla.lustre.org/show_bug.cgi?id=21745 Cheers, Bernd On Wednesday, October 20, 2010, Michael Kluge wrote: Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device. We trapped into one or two bugs with obdfilter-survey as lctl has at least one bug in 1.8.3 when is uses multiple threads and obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+ MB/s for obdfilter-survey over a reasonable parameter space after we changed to the ext4 based ldiskfs. So that seems to be the trick. Michael Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger: On 2010-10-18, at 10:40, Johann Lombardi wrote: On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote: dd if=/dev/zero of=$RAM_DEV bs=1M count=1000 mke2fs -O journal_dev -b 4096 $RAM_DEV mkfs.lustre --device-size=$((7*1024*1024*1024)) --ost --fsname=luram --mgsnode=$MDS_NID --mkfsoptions=-E stride=32,stripe-width=256 -b 4096 -j -J device=$RAM_DEV /dev/disk/by-path/... mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1 In fact, Lustre uses additional mount options (see Persistent mount opts in tunefs.lustre output). If your ldiskfs module is based on ext3, you should add the extents and mballoc options which are known to improve performance. Even then, the IO submission path of ext3 from userspace is not very good, and such a performance difference is not unexpected. When submitting IO from userspace to ext3/ldiskfs it is being done in 4kB blocks, and each block is allocated separately (regardless of mballoc, unfortunately). When Lustre is doing IO from the kernel, the client is aggregating the IO into 1MB chunks and the entire 1MB write is allocated in one operation. That is why we developed the delalloc code for ext4 - so that userspace could also get better IO performance, and utilize the multi-block allocation (mballoc) routines that have been in ldiskfs for ages, but only accessible from the kernel. For Lustre performance testing, I would suggest looking at lustre-iokit, and in particular sgpdd to test the underlying block device, and then obdfilter-survey to test the local Lustre IO submission path. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Maximum OST Size
On Wednesday, October 20, 2010, Andreas Dilger wrote: On 2010-10-19, at 08:27, Roger Spellman wrote: I don't understand this comment: For the MDT, yes, you could potentially use -i 1500 as about the minimum space per inode, but then you risk running out of space in the filesystem before running out of inodes. If we set -I to 512, then on an MDT, what else is there that would cause require 1500 bytes per inode? With -I 512 that means the actual inode will consume 512 bytes, so with -i 1536 there would be 1024 bytes per inode of block space still available. That extra space is needed for everything else in the filesystem, including the journal, directory blocks, Lustre metadata (last_rcvd, distributed transaction logs, etc), and any external xattr blocks for widely-striped files (beyond 12 stripes or so). I have to admit, I entirely fail to understand why we should need 2/3 of the filesystem reserved for real file data. - journal - 400MB - negligible with recent decent MDT sizes (1TiB+) - directory blocks, maybe, but I have noticed any system where that takes more than 5% - Lustre metadata (last_rcvd, distributed transaction logs, etc) - negligible with recent decent MDT sizes - external xattr for Lustre lov and additional ACLs: Maybe, depends on the customer With the default -i 4096, it looks like that for most customers I know of: df -h: 973G 57G 861G 7% /lustre/lustre/mdt df -ih: 278M248M 31M 89% /lustre/lustre/mdt So doubling inode ratio to -i2048 or even quadrupling it to -i1024 seems to be recommendable. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] high CPU load limits bandwidth?
On Wednesday, October 20, 2010, Andreas Dilger wrote: On 2010-10-20, at 10:40, Michael Kluge michael.kl...@tu-dresden.de wrote: It is the CPU load on the client. The dd/IOR process is using one core completely. The clients and the servers are connected via DDR IB. LNET bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3 patchless. If you only have a single threaded write, then this is somewhat unavoidable to saturate a CPU due to copy_from_user(). O_DIRECT will avoid this. Also, disabling data checksums and debugging can help considerably. There is a patch in bugzilla to add support for h/w crc32c on Nehalem CPUs to reduce this overhead, but still not as fast as no checksum at all. I think checksums are only visible in ptlrpc CPU time (and most also only for reads), but not in the user space benchmark process. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS export
Hello Alfonso, On Monday, October 18, 2010, Alfonso Pardo wrote: Hello, I need to export a lustre directory from one lustre-client to another client, buy always get the next message in the nfs-server: Cannot export /data, possibly unsupported filesystem or fsid= required any suggest? add to your NFS export line fsid=$some_number. Please also note that if you are using at RedHat System, that you should use a patch Lustre Server version on the NFS export node (lustre client), as the RedHat default 4K stacksize is too small and Lustre patched kernels have increased that to 8K (default on all system except RHEL). Cheers, Bernd PS: Btw, Ciemat has a DDN Lustre system, so you could also send requests to supp...@ddn.com (please add [Lustre] in the subject line). -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS export
Do *NOT* use 1.8.0 please, that is really old. You can easily install an updated kernel and Lustre version on centos 5.2. So you may install upstream Oracle 1.8.4 (downloads.lustre.org). We also have patched 1.8.3-ddn3.3 with latest RedHat security patches (and patches for Lustre): http://eu.ddn.com:8080/lustre/lustre/1.8.3/ddn3.3/ (1.8.4-ddnX is in testing). Cheers, Bernd On Monday, October 18, 2010, Alfonso Pardo wrote: Yes, I'am using a RedHat system (centos 5.2). Please, colud you say me, where I can found that patch Lustre Sever? SSOO: Centos 5.2 Lustre version: Client lustre 1.8.0 El lun, 18-10-2010 a las 11:32 +0200, Bernd Schubert escribió: Hello Alfonso, On Monday, October 18, 2010, Alfonso Pardo wrote: Hello, I need to export a lustre directory from one lustre-client to another client, buy always get the next message in the nfs-server: Cannot export /data, possibly unsupported filesystem or fsid= required any suggest? add to your NFS export line fsid=$some_number. Please also note that if you are using at RedHat System, that you should use a patch Lustre Server version on the NFS export node (lustre client), as the RedHat default 4K stacksize is too small and Lustre patched kernels have increased that to 8K (default on all system except RHEL). Cheers, Bernd PS: Btw, Ciemat has a DDN Lustre system, so you could also send requests to supp...@ddn.com (please add [Lustre] in the subject line). -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs performance vs. XFS performance
Hello Michael, On Monday, October 18, 2010, Michael Kluge wrote: Hi list, we have Lustre 1.8.3 running on a DDN 9900. One LUN (10 discs) formatted with XFS shows 400 MB/s if oppressed with one 'dd' and large block sizes. One LUN formatted an mounted with ldiskfs (the ext3 based that is default in 1.8.3.) shows 110 MB/s. It this the expected behaviour? It looks a bit low compared to XFS. Yes, unfortunately not entirely unexpected, with upstream Oracle versions. Firstly, please send a mail to supp...@ddn.com and ask for the udev tuning rpm (please add [Lustre] in the subject line). Then see this MMP issue here: https://bugzilla.lustre.org/show_bug.cgi?id=23129 which requires https://bugzilla.lustre.org/show_bug.cgi?id=22882 (as Lustre requires contributor agreements and as self-signed agreements do not work anymore, that presently causes some headache and brought in legacy and as always with bureaucracy it takes ages to sort it out - so landing our patches is delayed presently). In order to prevent data corruption in case of controller failures, you should also disable the S2A write back cache and enable async-journals instead on Lustre (enabled by default in DDN Lustre versions). We think with help from DDN we did everything we can from a hardware perspective. We formatted the LUN with the correct striping and stripe size, DDN adjusted some controller parameters and we even put the file system journal on a RAM disk. The LUN has 16 TB capacity. I formated only 7 for the moment due to the 8 TB limit. You should use ext4 based ldiskfs to get more than 8TiB. Our releases use that as default. This is what I did: mds_nid...@somehwere RAM_DEV=/dev/ram1 dd if=/dev/zero of=$RAM_DEV bs=1M count=1000 mke2fs -O journal_dev -b 4096 $RAM_DEV mkfs.lustre --device-size=$((7*1024*1024*1024)) --ost --fsname=luram --mgsnode=$MDS_NID --mkfsoptions=-E stride=32,stripe-width=256 -b 4096 -j -J device=$RAM_DEV /dev/disk/by-path/... mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1 Is there a way to push the bandwidth limit for a single data stream any further? While it could make it difficult with support, you could use our DDN Lustre releases: http://eu.ddn.com:8080/lustre/lustre/1.8.3/ddn3.3/ Hope it helps, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting
We then ran into the same problems with openibd hanging on shutdown. After a futile attempt trying to inject a lustre-unload-modules service between netfs and openib to run lustre_rmmod. I tried to hack modprobe.conf to eject the lustre modules by inserting this remove rdma_cm /usr/sbin/lustre_rmmod /sbin/modprobe -r --ignore-remove rdma_cm this didn't work either because the openibd service script use rmmod instead of modprobe -r (aargghh). All of that seem to be rather ugly workarounds. I think we need to figure out why rmmod of infiniband modules not just fails, when still used by lustres o2ib moduls. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts
On Friday, September 17, 2010, Andreas Dilger wrote: On 2010-09-17, at 12:42, Jonathan B. Horen wrote: We're trying to architect a Lustre setup for our group, and want to leverage our available resources. In doing so, we've come to consider multi-purposing several hosts, so that they'll function simultaneously as MDS OSS. You can't do this and expect recovery to work in a robust manner. The reason is that the MDS is a client of the OSS, and if they are both on the same node that crashes, the OSS will wait for the MDS client to reconnect and will time out recovery of the real clients. Well, that is some kind of design problem. Even on separate nodes it can easily happen, that both MDS and OSS fail, for example power outage of the storage rack. In my experience situations like that happen frequently... I think some kind a pre-connection would be required, where a client can tell a server, that it was rebooted and that the server shall not to wait any longer for it. Actually, shouldn't be that difficult, as already different connection flags exist. So if the client contacts a server and ask for an initial connection, the server could check for that NID and then immediately abort recovery for that client. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts
Hello Cory, On 09/17/2010 11:31 PM, Cory Spitz wrote: Hi, Bernd. On 09/17/2010 02:48 PM, Bernd Schubert wrote: On Friday, September 17, 2010, Andreas Dilger wrote: On 2010-09-17, at 12:42, Jonathan B. Horen wrote: We're trying to architect a Lustre setup for our group, and want to leverage our available resources. In doing so, we've come to consider multi-purposing several hosts, so that they'll function simultaneously as MDS OSS. You can't do this and expect recovery to work in a robust manner. The reason is that the MDS is a client of the OSS, and if they are both on the same node that crashes, the OSS will wait for the MDS client to reconnect and will time out recovery of the real clients. Well, that is some kind of design problem. Even on separate nodes it can easily happen, that both MDS and OSS fail, for example power outage of the storage rack. In my experience situations like that happen frequently... I think that just argues that the MDS should be on a separate UPS. well, there is not only a single reason. Next hardware issue is that maybe an IB switch fails. And then have also seen cascading Lustre failures. It starts with an LBUG on the OSS, which triggers another problem on the MDS... Also, for us this actually will become a real problem, which cannot be easily solved. So this issue will become a DDN priority. Cheers, Bernd -- Bernd Schubert DataDirect Networks signature.asc Description: OpenPGP digital signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mixing server versions
On Wednesday, September 15, 2010, Andreas Dilger wrote: On 2010-09-15, at 13:32, Brock Palen wrote: Thanks that is great to know. Is there much risk to try with 1.8 and then back off to 1.6 if there are issues? Risk of data loss? We do not test/support formatting at a higher Lustre version, and then downgrading below the original version used for formatting. With 1.6-1.4 this definitely did not work, though I'm not sure if there are specific incompatibilities between 1.8-1.6. I go back and forth from 1.8 to 1.6 quite often. The only thing to take care about is to make sure the MDS does not get the extents flag if ext4-ldiskfs is used. I think only beginning with 1.8.4 that is ensured by Lustre itself. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot get an OST to activate
Assuming the disk really is empty then, and LAST_ID really is zero, shall I then leave it at zero, and follow the recommendation of page 23-14, ie, just shut down again, delete the lov_objid file on the MDS, and restart the system? Certainly the value at the correct index (29) is definitely hosed: # od -Ax -td8 /mnt/mdt/lov_objid (snip) d0 292648 346413 e068225 -7137254917378053186 f0 5906459607 00010059227 59414 Yes, that is definitely hosed. Deleting the lov_objid file from the MDS and remounting the MDS should fix this value. You could also just binary edit the file and set this to 1. Andreas, Bob, please be very very careful with lov_objid. As I already wrote last week, I get reproducibly and always a hard kernel panic when I tested and deleted the file and then mounted the MDT again. You can try it, but DO CREATE A BACKUP of this file, so that you can copy it back, if something goes wrong. Sorry, I don't have the time right now to work on the lob_objid-delete-bug, not even time to write a suitable bug report :( Cheers, Bernd signature.asc Description: OpenPGP digital signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Large directory performance
On Saturday, September 11, 2010, Andreas Dilger wrote: On 2010-09-10, at 12:11, Michael Robbert wrote: Create performance is a flat line of ~150 files/sec across the board. Delete performance is all over the place, but no higher than 3,000 files/sec... Then yesterday I was browsing the Lustre Operations Manual and found section 33.8 that says Lustre is tested with directories as large as 10 million files in a single directory and still get lookups at a rate of 5,000 files/sec. That leaves me wondering 2 things. How can we get 5,000 files/sec for anything and why is our performance dropping off so suddenly at after 20k files? Here is our setup: All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355 @ 2.66GHz and 16Gb of RAM. The data is on DDN S2A 9550s with 8+2 RAID configuration connected directly with 4Gb Fibre channel. Are you using the DDN 9550s for the MDT? That would be a bad configuration, because they can only be configured with RAID-6, and would explain why you are seeing such bad performance. For the MDT you always Unfortunately, we failed to copy the scratch MDT in a reasonable time so far. Copying several hundreds of million files turned out to take ages ;) But I guess Mike did the benchmarks for the other filesystem with an EF3010. We have as many as 1.4 million files in a single directory and we now have half a billion files that we need to deal with in one way or another. Mike, is there a chance you can try which rate acp reports? http://oss.oracle.com/~mason/acp/ Also could you please send me your exact bonnie line or script? We could try to reproduce it on and idle test 9550 with a 6620 for metada (the 6620 is slower for that than the ef3010). Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot get an OST to activate
On Friday, September 03, 2010, Bob Ball wrote: We added a new OSS to our 1.8.4 Lustre installation. It has 6 OST of 8.9TB each. Within a day of having these on-line, one OST stopped accepting new files. I cannot get it to activate. The other 5 seem fine. On the MDS lctl dl shows it IN, but not UP, and files can be read from it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5 However, I cannot get it to re-activate: lctl --device umt3-OST001d-osc activate [...] LustreError: 4697:0:(filter.c:3172:filter_handle_precreate()) umt3-OST001d: ignoring bogus orphan destroy request: obdid 11309489156331498430 last_id 0 Can anyone tell me what must be done to recover this disk volume? Check out section 23.3.9 in the Lustre manual (How to Fix a Bad LAST_ID on an OST). It is on my TODO list to write tool to automatically correct the lov_objid, but as of now I don't have it yet. Somehow your lov_objid file has a completely wrong value for this OST. Now, when you say files can be read from it, are you sure there are already files on that OST? Because the error message says that the last_id is zero and so you should not have a single file on it. If that is also wrong, you will need to correct it as well. You can do that manually, or you can use a patched e2fsprogs version, that will do that for you Patches are here: https://bugzilla.lustre.org/show_bug.cgi?id=22734 Packages can be found on my home page: http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/ If you want to do it automatically, you will need to create a lfsck mdsdb file (the hdr file is sufficient, see the lfsck section in the manual) and then you will need to run e2fsck for that OST as if you want to create an OSTDB file. That will start pass6, and if you then run e2fsck *without* -n, so in correcting mode, it will correct the LAST_ID file to what it finds on disk. With -v it will also tell you the old and the new value and then you will need to put that value properly coded into the MDS lov_objid file. Be careful and create backups of the lov_objid and LAST_ID files. Hope it helps, Bern -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Cannot get an OST to activate
On Friday, September 03, 2010, Bernd Schubert wrote: On Friday, September 03, 2010, Bob Ball wrote: We added a new OSS to our 1.8.4 Lustre installation. It has 6 OST of 8.9TB each. Within a day of having these on-line, one OST stopped accepting new files. I cannot get it to activate. The other 5 seem fine. On the MDS lctl dl shows it IN, but not UP, and files can be read from it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5 However, I cannot get it to re-activate: lctl --device umt3-OST001d-osc activate [...] LustreError: 4697:0:(filter.c:3172:filter_handle_precreate()) umt3-OST001d: ignoring bogus orphan destroy request: obdid 11309489156331498430 last_id 0 Can anyone tell me what must be done to recover this disk volume? Check out section 23.3.9 in the Lustre manual (How to Fix a Bad LAST_ID on an OST). It is on my TODO list to write tool to automatically correct the lov_objid, but as of now I don't have it yet. Somehow your lov_objid file has a completely wrong value for this OST. Now, when you say files can be read from it, are you sure there are already files on that OST? Because the error message says that the last_id is zero and so you should not have a single file on it. If that is also wrong, you will need to correct it as well. You can do that manually, or you can use a patched e2fsprogs version, that will do that for you Patches are here: https://bugzilla.lustre.org/show_bug.cgi?id=22734 Packages can be found on my home page: http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/ If you want to do it automatically, you will need to create a lfsck mdsdb file (the hdr file is sufficient, see the lfsck section in the manual) and then you will need to run e2fsck for that OST as if you want to create an OSTDB file. That will start pass6, and if you then run e2fsck *without* -n, so in correcting mode, it will correct the LAST_ID file to what it finds on disk. With -v it will also tell you the old and the new value and then you will need to put that value properly coded into the MDS lov_objid file. Update for the lov_objd file, actually, if you rename or delete it (rename it please, so that you have a backup), the MDS should be able to re-create it from OST LAST_ID data. So if the troublesome OST has no data yet, it will be very easy, if it already has data, you will need to correct the LAST_ID on that OST first. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDT backup (using tar) taking very long
On Thursday, September 02, 2010, Frederik Ferner wrote: Hi List, we are currently reviewing our backup policy for our Lustre file system as backups of the MDT are taking longer and longer. Yes, that is due to the size-on-mds feature, which was introduced into 1.6.7.2. See bug https://bugzilla.lustre.org/show_bug.cgi?id=21376 It has a patch, that also got accepted in upstream tar last week. You may find updated RHEL5 tar packages on my home page: http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/ Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] brief 'hangs' on file operations
On Thursday, September 02, 2010, Andreas Dilger wrote: On 2010-09-02, at 06:43, Tina Friedrich wrote: Causing most grieve at the moment is that we sometimes see delays writing files. From the writing clients end, it simply looks as if I/O stops for a while (we've seen 'pauses' of anything up to 10 seconds). This appears to be independent of what client does the writing, and software doing the writing. We investigated this a bit using strace and dd; the 'slow' calls appear to always be either open, write, or close calls. Usually, these take well below 0.001s; in around 0.5% or 1% of cases, they take up to multiple seconds. It does not seem to be associated with any specific OST, OSS, client or anything; there is nothing in any log files or any exceptional load on MDS or OSS or any of the clients. This is most likely associated with delays in committing the journal on the MDT or OST, which can happen if the journal fills completely. Having larger journals can help, if you have enough RAM to keep them all in memory and not overflow. Alternately, if you make the journals small it will limit the latency, at the cost of reducing overall performance. A third alternative might be to use SSDs for the journal devices. As diamond uses DDN hardware, it should help in general with performance to update to 1.8 and to enable the async journal feature. I guess it also might help to reduce those delays, as writes are more optimized. A question, though. Tina, do you use our ddn udev rules, which tune the devices for optimized performance? If not, please send a mail to supp...@ddn.com and ask for a recent udev rpm please (available for RHEL5 only so far, also *might* work on SLES11, but udev syntax changes to often, IMHO). And put [lustre] into the subject line please, as the lustre team maintains them. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDT backup (using tar) taking very long
On Thursday, September 02, 2010, Frederik Ferner wrote: Bernd Schubert wrote: On Thursday, September 02, 2010, Frederik Ferner wrote: we are currently reviewing our backup policy for our Lustre file system as backups of the MDT are taking longer and longer. Yes, that is due to the size-on-mds feature, which was introduced into 1.6.7.2. See bug https://bugzilla.lustre.org/show_bug.cgi?id=21376 It has a patch, that also got accepted in upstream tar last week. You may find updated RHEL5 tar packages on my home page: Thanks, I'll give that a go. (Any chance of adding the SRPM to your download page?) I don't like SRPM too much, I updloaded a tar.bz2 instead. It is a hg repository and mq managed, so patches (including those RedHat added) are in .hg/patches. You will another bugfix there compared to -sun packages. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] blk_rq_check_limits errors
On Thursday, September 02, 2010, Frank Heckes wrote: Hi all, for some of our OSSes a massive amount of errors like: Sep 2 20:28:15 jf61o02 kernel: blk_rq_check_limits: over max size limit. appearing in /var/log/messages (and dmesg). Does anyone have got a clue how-to get of the root cause? Many thanks in advance. linux/block/blk-core.c int blk_rq_check_limits(struct request_queue *q, struct request *rq) { if (rq-cmd_flags REQ_DISCARD) return 0; if (blk_rq_sectors(rq) queue_max_sectors(q) || blk_rq_bytes(rq) queue_max_hw_sectors(q) 9) { printk(KERN_ERR %s: over max size limit.\n, __func__); return -EIO; } I haven't seen that before, but if I should guess, I would guess that dm-* has a larger queue than your underlying block device. If that is with your DDN storage, can you verify if all those devices are have max_sec_kb tuned to maximum? Also, does that come up with 1.8.4 only? (I have SG_ALL in my mind which was increased from 255 to 256, which might not be supported by all scsi host adapters). Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] blk_rq_check_limits errors
On Thursday, September 02, 2010, Frank Heckes wrote: Hi all, for some of our OSSes a massive amount of errors like: Sep 2 20:28:15 jf61o02 kernel: blk_rq_check_limits: over max size limit. appearing in /var/log/messages (and dmesg). Does anyone have got a clue how-to get of the root cause? Many thanks in advance. linux/block/blk-core.c int blk_rq_check_limits(struct request_queue *q, struct request *rq) { if (rq-cmd_flags REQ_DISCARD) return 0; if (blk_rq_sectors(rq) queue_max_sectors(q) || blk_rq_bytes(rq) queue_max_hw_sectors(q) 9) { printk(KERN_ERR %s: over max size limit.\n, __func__); return -EIO; } I haven't seen that before, but if I should guess, I would guess that dm-* has a larger queue than your underlying block device. If that is with your DDN storage, can you verify if all those devices are have max_sec_kb tuned to maximum? Also, does that come up with 1.8.4 only? (I have SG_ALL in my mind which was increased from 255 to 256, which might not be supported by all scsi host adapters). Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS memory usage
Hello Frederik, On Wednesday, August 25, 2010, Frederik Ferner wrote: Hi Bernd, thanks for your reply. Bernd Schubert wrote: On Tuesday, August 24, 2010, Frederik Ferner wrote: on our MDS we noticed that all memory seems to be used. (And it's not just normal buffers/cache as far as I can tell.) When we put load on the machine, for example by starting rsync on a few clients, generating file lists to copy data from Lustre to local disks or just running a MDT backup locally using dd/gzip to copy a LVM snapshot to a remote server, kswapd starts using a lot of CPU time, sometimes up to 100% of one CPU core. This is on a Lustre 1.6.7.2.ddn3.5 based file system with about 200TB, the MDT is 800GB with 200M inodes, ACLs enabled. Did you recompile it, or did you use the binaries from my home page (or those you got from CV)? This is a recompiled Lustre version to include the patch from bug 22820. Possibly it is a LRU auto-resize problem, but which has been disabled in DDN builds. As our 1.6 releases didn't include a patch for that, you would need to specify the correct command options if you recompiled it. I guess it's likely that I have not specified the correct option. So the binaries on your home page are compiled with '--disable-lru-resize'? Any other options that you used? I always enable the health-write, which will help pacemaker to detect IO errors (by monitoring /proc/fs/lustre/health_check) --enable-health-write Another reason might be bug 22771, although that should only come up on MDS with more memory you have. I had a look at that bug and while we have a default stripe count of 1 so the stripe count should fit into the inode. On the other hand we use ACLs in quite a few places, so it seems we might hit this bug if we increase the memory from the 16GB currently, correct? Yeah and I think 16GB should be sufficient for the MDS. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] MDS memory usage
On Tuesday, August 24, 2010, Frederik Ferner wrote: Hi List, on our MDS we noticed that all memory seems to be used. (And it's not just normal buffers/cache as far as I can tell.) When we put load on the machine, for example by starting rsync on a few clients, generating file lists to copy data from Lustre to local disks or just running a MDT backup locally using dd/gzip to copy a LVM snapshot to a remote server, kswapd starts using a lot of CPU time, sometimes up to 100% of one CPU core. This is on a Lustre 1.6.7.2.ddn3.5 based file system with about 200TB, the MDT is 800GB with 200M inodes, ACLs enabled. Did you recompile it, or did you use the binaries from my home page (or those you got from CV)? Possibly it is a LRU auto-resize problem, but which has been disabled in DDN builds. As our 1.6 releases didn't include a patch for that, you would need to specify the correct command options if you recompiled it. Another reason might be bug 22771, although that should only come up on MDS with more memory you have. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Fwd: Lustre and Large Pages
Last week there was an article on lwn.net about Transparent hugepages discussed during The fourth Linux storage and filesystem summit. According to that article, we might be luckily and those patches might go into RHEL6 If you do not have an lwn.net account you might need to wait a few weeks: http://lwn.net/Articles/398846/ It links an older article about it, which should be already avaible for all: http://lwn.net/Articles/359158/ And another one: http://lwn.net/Articles/374424/ Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Question on setting up fail-over
On Tuesday, August 10, 2010, David Noriega wrote: So your script resets the server so there is no fail-over(ie the other server takes over resources from that server?) or there is failover but you then manually return resources back to the server that was reset? Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker stonith terms) only makes absolutely sure the node was really reset. If something fails, an error code is reported to pacemaker and then pacemaker (*) will not initiate resource fail-over in order to prevent split-brain. As Lustre devices use MMP (multiple-mount protection) that is not strictly required, in principal. But if something goes wrong. e.g. MMP was accidentally not enabled, a double mount could come up and that would cause serious filesystem and data corruption... Cheers, Bernd PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept stonith error codes, but in general, I have seen it more than once that hearbeat-v1 run into split-brain and started resources on both cluster nodes. That is something where pacemaker does a much better job. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] How to achieve 20GB/s file system throughput?
On Saturday, July 24, 2010, henry...@dell.com wrote: Hello, One of my customer want to set up HPC with thousands of compute nodes. The parallel file system should have 20GB/s throughput. I am not sure whether lustre can make it. How many IO nodes needed to achieve this target? My assumption is 100 or more IO nodes(rack servers) are needed. I'm a bit prejudiced, of course, but with DDN storage that would be quite simple. With the older DDN S2A 9990, you can get 5GB/s per controller-pair , with the newer SFA1 you can get 6.5 to 7GB/s (we are still tuning it) per controller pair. Each controller pair (couplet in DDN terms) usually has 4 servers connected and fits into single rack in a 300 drive configuration. So you can get 20GB/s with 3 or 4 racks and 12 or 16 OSS servers, which is much below your 100 IO nodes ;) Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] NFS Export Issues
On Tuesday, July 20, 2010, William Olson wrote: As far as i remember we had to explicitly align the mounting user uids and gids. So the mounting uid:gid must be known (/etc/passwd and groups i think) on the mds and allowed to mount stuff. Could it be a root squash problem by accident ? I've tried it explicitly with the no_root_squash option and it still behaves the same way.. What I find really frustrating is that if I unmount lustre, I can mount the same nfs export, no problems. As soon as lustre is mounted to that directory I can no longer mount that nfs export. I don't understand where it's failing.. Thanks for all your help so far, any more ideas? Have you tried to add fsid=xxx to your exports line? I think with recently Lustre versions (I don't remember the implementation details) it should not be required any more and so it should not with recent nfs-utils and until-linux (the filesystem uuid is automatically used with those, instead of device major/minor as fsid), but maybe both type of workarounds conflict on your system? You also might consider to simply use unfs3, although performance will be limited to about 120MB/s, as unfs3 is only single threaded. It also does not support NFS locks. If it still does not work out, you should enabled lustre debugging, nfs debugging and you probably should use wireshark to see what it going on. Hope it helps, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] I/O error on clients
On Tuesday, July 20, 2010, Christopher J. Morrone wrote: On 07/07/2010 01:04 AM, Gabriele Paciucci wrote: Hi, the ptlrcp bug is a problem, but i don't find in the Peter's logs any refer to an eviction caused by the ptlrpc but instead by a timeout during the comunication between a ost and the client. But Peter could make a downgrade to 1.8.1.1 that not suffer by the problem. The bug that I describe does not have any messages about the ptlrpcd performing evictions. The server's I think it's dead, and I am evicting it, and other messages about the server timing out on the client are the only messages that you will see with the bug that I described. But like I said, there are many possible reasons for timeouts, so it could easily be something else. For the record, while stress testing Lustre I can easily reproduce evictions with any Lustre version. However, it is terribly difficult to debug it without additional tools. I have opened a bugzilla for that, but I don't think I will have time for those tools any time soon. https://bugzilla.lustre.org/show_bug.cgi?id=23190 -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] tunefs.lustre --print fails on mounted mdt/ost with mmp
On Wednesday, July 14, 2010, Andreas Dilger wrote: On 2010-07-14, at 13:29, Nate Pearlstein wrote: Just checking to be sure this isn't a known bug or problem. I couldn't find a bz for this, but it would appear that tunefs.lustre --print fails on a lustre mdt or ost device if mounted with mmp. Is this expected behavior? Not really expected... It is reading the mountdata file via debugfs, so that should be safe even on a mounted filesystem, but it doesn't work with MMP: # debugfs -c -R stats /dev/vgbackup/lvtest debugfs 1.41.10.sun2 (24-Feb-2010) /dev/vgbackup/lvtest: MMP: device currently active while opening filesystem stats: Filesystem not open This is already fixed in our next release of e2fsprogs, however, thanks to a patch Jim Garlick @ LLNL. It is the first hunk of the patch at: https://bugzilla.lustre.org/attachment.cgi?id=30441 Actually not required, as it opens the device read-only. I would recommend the first patch from https://bugzilla.lustre.org/show_bug.cgi?id=22421 that also allows to run other e2fs tools, such as e2fsck -n, dumpe2fs -h and tunefs.lustre --print on mounted devices. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] files missing after writeconf
If the device really has been reformated and the data would be very important, there also would be the option to to follow the recovery procedure we have done last year, after an MDT was accidentally reformated. It took several months, also because I was busy with lots of parallel tasks, in the end it was mostly successful. Not perfect, but at least a big part of the directory structure could be recovered (also thanks to helpful discussions with Andreas). I probably should write a document what needs to be done and publish the tools. But even with that it will be time consuming, although using the existing tools, that will take much less time than last time... Cheers, Bernd On Friday, July 09, 2010, Andreas Dilger wrote: Unmount the MDS and mount it as type ldiskfs and list the ROOT directory. If there are no files there then it seems that somehow you have deleted or reformatted the MDS Filesystem. You could also check lost+found at that point in case your files were moved by e2fsck for some reason. Check 'dumpe2fs -h' on the mds device to see what the format time is. If there are no more files on the MDS then the best you can do is to run lfsck and link all the orphan objects into the Lustre lost+found dir and look at the file contents to identify them. If you have a backup it would be easier to just restore from that. Sorry. Cheers, Andreas On 2010-07-08, at 19:34, David Gucker dguc...@choopa.com wrote: When bringing up the cluster after a full powerdown, the MDS/MGS node was reporting the following for for each of the OSTs: Jul 8 17:16:18 ID6317 kernel: LustreError: 13b-9: Test01-OST claims to have registered, but this MGS does not know about it, preventing registration. Jul 8 17:16:18 ID6317 kernel: LustreError: 26184:0:(mgs_handler.c:660:mgs_handle()) MGS handle cmd=253 rc=-2 I have two OSS's and checked back to my mkfs commands and it looks like I forgot to enable failover in the options. So I found that I could update that flag using tunefs.lustre. Looking into that a bit I found that I should run it with --writeconf flag as well. So, I unmounted the OST's and ran: tunefs.lustre --param failover.mode=failout /dev/iscsi/ost-1.target0 on each of them. After doing this (and maybe remounting the mds/mgs), I was able to mount the OSTs, and then mounted the client but all data was missing. The filesystem reports 11% full which is about right for the data that was on there but no files. After reading the docs a bit better I found that I should have done things more properly (fully shutdown and unloaded the filesystem, then done the writeconf beginning with the mgs). So I tried running through the proceedure a little better and filesystem is in the same state (appears to be fine, just shows used space and no files). I was unable to recreate this in another test cluster (no data loss). So, I'm wondering if these files are recoverable at all? Can anyone point me in the right direction, if there is one? Dave ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] short writes
On Thursday, July 08, 2010, Brian J. Murrell wrote: On Thu, 2010-07-08 at 07:53 -0600, Kevin Van Maren wrote: Hi David, Hey Kevin, http://www.opengroup.org/onlinepubs/95399/functions/write.html Heh. Funny enough, I was reading the exact same URL. I always thought libc should handle the retry for you by default, but I didn't write the spec. write(2) is a system call, not a libc function. fwrite(3) is a comparable libc function, so libc might be able to handle short write(2)s in fwrite(3), but really it should not (IMHO) be mucking with write(2) (or any other) system calls. You have to keep in mind that Gaussian is a Fortran application. Fortran has its own IO library and it would be quite possible that the library of some compilers can handle a short write, but the library of other compilers can not... In my university group we had to deal with quite some weird effects between fortran-IO implementations... David, did you use PGI or another compiler? Last time I had to deal with Gaussian only PGI was supported, but I have not checked for recent Gaussian versions. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] How to determine which lustre clients are loading filesystem.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 07/08/2010 11:21 PM, Andreas Dilger wrote: On 2010-07-08, at 14:01, Guy Coates wrote: Try this script; (It is from Bernd Schubert). It will parse the per-client proc stats on the mds/oss into something nice and humanly-readable. It is very useful. I'm not sure I'd quite call it human readable, but it does show that there is a need for something to print out stats for all of the clients. Yeah, I agree, it is not perfect yet. Especially it needs to be sorted by clients doing most IO. That shouldn't be too difficult with the existing script. [...] Bernd, would you (or anyone) be interested to enhance those tools to be able to show stats data from multiple files at once (each prefixed by the device name and/or client NID)? I don't think it makes sense to create separate tools for this. I'm not sure if the existing lustre tools are really what we need. If you have a cluster with 200 or more clients and then want to figure out which clients are doing most IO, several lines per client provide too much output. One line sorted by IO seems to be better, IMHO. I would be for interested to enhance the existing tools, but then if I look into the number of open bugs I have, several of those have a higher priorty (btw, this script is among my bug list (bug 22469)). Additionally at least still for the next couple of weeks I'm very very limited with my time to finish my thesis. Cheers, Bernd - -- Bernd Schubert DataDirect Networks -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkw2TQEACgkQqh74FqyuOzS0XQCgs7J7MqetIr1Y99gIqXBa9ntW 9pgAn2gFp+gI6R2aa3GverrNsR4v9bfO =YcKt -END PGP SIGNATURE- ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Max bandwidth through a single 4xQDR IB link?
Hello Ashley, hello Kevin, I really see no point to use disks to benchmark performance, when lnet_selftest exists. Benchmark order should be: - test how much the disks can provide - test network with lnet_selftest = make sure lustre performance is not much below the min(disks, lnet_selftest) Cheers, Bernd On Tuesday, June 29, 2010, Kevin Van Maren wrote: DAPL is a high-performance interface that uses a small shim to provide a common DMA API on top of (in this case) the IB verbs layer. In general, there is a very small performance impact to be able to use the common API, so you will not get more large-message bandwidth using native IB verbs. I've never had enough disk bandwidth behind a node to saturate a QDR IB link, so I'm not sure how high LNET will go. If you have an IB test cluster, you should be able to measure the upper limits by creating an OST on a loopback device on tmpfs, although you have to ensure the client-side cache is not skewing your results (hint: boot client with something like mem=1g to limit the ram they can use for the cache). While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB packets), the maximum HCA bandwidth is normally around 3.2GB/s (unidirectional), due to the PCIe overhead of breaking the transaction into (relatively) small packets and managing the packet flow control/credits. This is independent of the protocol, and limited by the PCIe Gen2 x8 PCIe interface. You will see somewhat higher bandwidth if your system supports and uses a 256 byte MaxPayload, rather than 128 bytes. Use lspci to see what your system is using, as in: lspci -vv -d 15b3: | grep MaxPayload Kevin Ashley Pittman wrote: Hi, Could anyone confirm to me the maximum achievable bandwidth over a single 4xQDR IB link into a OSS node. I have many clients doing a write test over IB and want to know the maximum bandwidth we can expect to see for each OSS node. For MPI over these links we see between 3 and 3.5BG/s but I suspect Lustre is capable of more than this because it's not using DALP, is this correct? Ashley. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Can't put file on specific device or see it it in lfs df -h
Привет Катя! On Tuesday, June 22, 2010, Katya Tutlyaeva wrote: Hi everybody! Of course, these devices are successfully mounted on OSS, when I move them using hb_takeover on another OSS (even if I move all devices, include mdt on second OSS or move these unworking devices on first OSS) first two OST's remains up and accessible, second two still N/A in df -h and for file striping. Please tell me if I missing something.. Can you post the output of lfs check servers on the client side? Looking forward to your advices! Difficult to say anything without log files. Пока, Бернд ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Using brw_stats to diagnose lustre performance
On Tuesday 15 June 2010, Kevin Van Maren wrote: Live is much easier with a 1MB (or 512KB) native raid stripe size. It looks like most IOs are being broken into 2 pieces. See https://bugzilla.lustre.org/show_bug.cgi?id=22850 for a few tweaks that would help get IOs 512KB to disk. See also Bug I played with a similar patch some time ago (blkdev defines), but didn't notice any performance improvements on the 9900 DDN S2A. Before increasing those values I got up to 7M IOs, after doubling MAX_HW_SEGMENTS and MAX_PHYS_SEGMENTS max IOs doubled to 14M. Unfortunately, also more IOs in between magic good IO sizes came up (magic good here: 1, 2, 3 ..., 14), so e.g. lots of 1008 or 2032, etc. Example numbers from a production system: LengthPort 1Port 2Port 3Port 4 KbytesReads WritesReads WritesReads WritesReads Writes 960 1DCD 2EEB 1E44 3532 1431 1D7E 14FB 2284 976 1ACD 34AC 1A0F 48EB 12E2 24AE 11E1 257F 992 1D46 3787 1CA7 51EB 144C 2E9B 1354 3A62 1008100A511B5C1039113765 A9B8 FBED 9E9A D457 1024 BFD41D 111F3C4 BFBE47 11A110D 8C316B C95178 8E5A9F C83850 1040 583 625 538 6C3 3F3 513 413 337 ... 2032 551 1260 50D 136B 3E4 1218 3C8 BA1 204841B85FDB213B8D1 10185731088B78E02C4A592F48 2064 FB 20 108 24 BE 19 C7 10 2080 E3 2F E6 37 AA 44 C7 1B ... 7152 55 6C7 58 80C 60 70D 3F 3B4 7168 449F E335 417C E743 3332 AB34 3686 A568 7184 291 142 191 140 I don't think it matters for any storage system if max IO is 7M or 14M, but those sizes in between are rather annoying. And from output of brw_stats I *sometimes* have no idea how that can happen. On that particular system I took the numbers from, users mostly don't do streaming writes, so it the reason is clear there. After tuning the FZJ (Kevin should know that system) system, the SLES11 kernel with chained scatter-gathering (so the blkdev patch is mostly not required anymore) can do IO sizes up to 12MB. Unfortunately, also quite some 1008s again out of the blue without an obvious reason (during my streaming writes with obdecho). Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.8.3 on kernel 2.6.22
Hello Jonas, On Monday 07 June 2010, Jonas Ambrus wrote: Hi Guys, i tried to compile lustre 1.8.3 on kernel 2.6.22 (vanilla-config). The configure-script of lustre works fine. But when I try to make the lustre it fails with the following reason: ___ Applying ext3-big-endian-check-2.6.22-vanilla 1 out of 5 hunks FAILED -- saving rejects to file fs/ext3/super.c.rej Patch ext3-big-endian-check-2.6.22-vanilla does not apply (enforce with -f) ... ___ I already tried to force it, but after that it's just a big mess and doesn't help ;) Do you have any suggestion to solve this problem? Background: I want to compile it because I also need a KVM-Module in my kernel. I'm using CentOS 5.4 but I'm also open for any another solution. the CentOS4 5.4 kernel (2.6.18-164) has a far more up2date kvm implementation than 2.6.22 has. So I really see no reason why should want to try 2.6.22? Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Selective e2fsprogs installation on Ubuntu
as a not existing program dumpe2fs - useful but not essential? Please, tell me if I am missing/misunderstanding something? Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Failed OST Cleanup
On Wednesday 02 June 2010, Andreas Dilger wrote: On 2010-06-02, at 11:54, Scott Barber wrote: I'm now trying to get a list of files that are now corrupt. On one of the lustre clients I'm running: lfs find --obd sanvol06-OST0013_UUID my lustre mount point It starts to list files and then a few minutes later it runs into an error and stops: cb_find_init: IOC_LOV_GETINFO on filename failed: Input/output error. In dmesg I see: LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO The file that gets that Input/output error cannot be delete or removed from the file system. How can I get around this? There is a bug in lfs find that it tries to get the file size unnecessarily. You can use lfs getstripe -obd ... instead, and it should work even if the OST is down. Hmm, yes and no. In principle I like the idea that lfs find tries to figure out the file size. A couple of years ago I had to deal with 3 disk failure of raid6 and although we tried to clone the 3rd failing disk, in the end we lost that OST. Now there was stripe size of 4M and a stripe count of 4 configured. When I then run 'lfs find' to find files located on that OST, it reported lots of file, that *would* have data on that OST, if the file would have sufficiently large. But then lots of files had been smaller than 1M and so it would have been wrong to delete those files. It turned out that 'lfs find' was rather useless for us and I simply had to read each file - if read succeeded all was fine, it it failed I moved it into a dedicated subdirectory. The missing OST later on was recreated (that was more easy that time with 1.4 than nowadays) and we only lost a small part of the file, definitely much less than what 'lfs find' suggested. So if 'lfs find' now used the filesize to determine if a file is really located on an OST, that would be an improvement. Of course, if it fails at all with an IO error, it is also not useful ;) Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre and Automount
On Thursday 27 May 2010, Fraser McCrossan wrote: David Simas wrote: On Thu, May 27, 2010 at 12:50:15PM -0600, Andreas Dilger wrote: There have been some reports of problems with automount and Lustre that have never been tracked down. If someone with automount experience and config, and time to track this down could investigate I'm sure we could work it out. The autofs that comes with RHEL5 won't mount Lustre. We got it working with autofs-5.0.3-36, from some version of Fedora. Later versions of autofs should also work. The fix to autofs is almost trivial. A function scans the mount command to make sure it contains just legal characters. To get Lustre to automount, it needs @ in it's list of such. We did find that the automounter sometimes failed to remove the Lustre record from /etc/mtab on unmounting the file system. That would cause subsequent remounts to fail. Another easy fix: link /etc/mtab to /proc/mounts. Also, remember that you can't mount Lustre subdirectories. That is, you can mount your Lustre filesystem as, say, /home, but you can't mount /home/username. Without having checked the code, I think it should be fairly simple to add support for that in mount.lustre: - Cut off the directory from the filesystem name - Mount Lustre into a temporary directory - Bind mount lustre into the target directory The most difficult part will be to write a single entry into /etc/mtab An approach that we are testing (but haven't tried in production yet) was suggested by an earlier post from Andreas Dilger, and involves two automounts. The first mounts the base Lustre filesystem(s) somewhere (say, /lustre) as a direct mount, the second is /etc/auto.home and looks like this: *-bind:/lustre/ In our case we have an executable map that generates the mount line based on the contents of the description field in LDAP (which indicates which Lustre FS contains the home - we have two), but the principle is the same. That should work. The disadvantage of bind mounts is that 'lfs' does not recognize it as type lustre and therefore all those nice lfs subcommands will not work. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Future of lustre 1.8.3+
On Wednesday 26 May 2010, Guy Coates wrote: On 26/05/10 17:25, Ramiro Alba Queipo wrote: On Wed, 2010-05-26 at 16:48 +0100, Guy Coates wrote: One thing to watch out for in your kernel configs is to make sure that: CONFIG_SECURITY_FILE_CAPABILITIES=N OK. But the question is if this issue still applies for lustre-1.8.3 and SLES kernel linux-2.6.27.39-0.3.1.tar.bz2. I mean, is quite surprising that if this problems persist, Oracle is offering lustre packages for SLES11 with CONFIG_SECURITY_FILE_CAPABILITIES=y ??? I am just about to start testing, so I'd like to clarify this. The binary SLES packages are fine; it is the source packages that may be problematic, depending on your config. There is a bug filed against this Sorry Guy. May be there is something I am missing, but SLES11 rpm kernel server packages for lustre-1.8.3 are created using a config with ONFIG_SECURITY_FILE_CAPABILITIES=y (See yourself on the attachement You are entirely correct. To be clear here, this is a CLIENT side issue. So whatever you do set on the server side is irrelevant. Oracle cannot set kernel options for clients, as those are upstream kernels and lustre is compiled patchless against it. Applying this Lustre patch from bugzilla#15587 should solve the issue without the need to recompile the kernel: https://bugzilla.lustre.org/attachment.cgi?id=29116 Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre and Automount
On Thursday 27 May 2010, David Simas wrote: On Thu, May 27, 2010 at 12:50:15PM -0600, Andreas Dilger wrote: There have been some reports of problems with automount and Lustre that have never been tracked down. If someone with automount experience and config, and time to track this down could investigate I'm sure we could work it out. The autofs that comes with RHEL5 won't mount Lustre. We got it working with autofs-5.0.3-36, from some version of Fedora. Later versions of autofs should also work. The fix to autofs is almost trivial. A function scans the mount command to make sure it contains just legal characters. To get Lustre to automount, it needs @ in it's list of such. I doubt that you cannot get it working. At my previous job we used NIS based automounter for almost everything including Lustre. All based on Debian, so I'm not absolutely sure about RedHat. However, it already did work with the very old Debian Sarge. The simple trick we had to do was to escape the @, so to use \@. We did find that the automounter sometimes failed to remove the Lustre record from /etc/mtab on unmounting the file system. That would cause subsequent remounts to fail. Another easy fix: link /etc/mtab to /proc/mounts. That also happens sometimes without automounter. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Future of lustre 1.8.3+
On Wednesday 19 May 2010, Heiko Schröter wrote: Am Mittwoch 19 Mai 2010, um 10:33:04 schrieben Sie: On 2010-05-19, at 01:40, Heiko Schröter wrote: we would like to know which way lustre is heading. From the s/w repository we see that only redhat and suse ditros seems to be supported. Is this the official policy of the lustre development to stick to (only) these two distros ? On the client side, we will support the main distros that our customers are using, namely RHEL/OEL/CentOS 5.x (and 6.x after release), and SLES 10/11. We make a best-effort attempt to have the client work with all client kernels, but since our resources are limited we cannot test kernels other than the supported ones. I don't see any huge demand for e.g. an officially-supported Ubuntu client kernel, but there has long been an unofficial Debian lustre package. On the server side, we will continue to support RHEL5.x and SLES10/11 for the Lustre 1.8 release, and RHEL 5.x (6.x is being worked on) for the Lustre 2.x release. Since maintaining kernel patches for other kernels is a lot of work, we do not attempt to provide patches for other than official kernels. However, there have in the past been ports of the kernel patches to other kernels by external contributors (e.g. FC11, FC12, etc) and this will hopefully continue in the future. The server side is the more critical part as we are using gentoo+lustre running a vanilla kernel 2.6.22.19 with the lustre patches version 1.6.6. As far as we are concerned it would be nice to have the pathces for the vanilla-kernels in 1.8.3+. This would be just fine. On the other hand if maintaining is the key problem on your side what would be a major argument against using a patched sles/rhel on a lustre server That is what I would recommend and what several groups do (usually with Debian, though). not running the sles/rhel distro ? I know a lot of things can happen but are these rhel/sles patches do brake some key features of the kernel which would only work under that specific distro ? I've positivley tested a lustre client with a sles patched kernel on a gentoo distro. But i'am a bit nervous about testing it on our live lustre server system. The only thing that really might cause trouble is udev, since sysfs maintainers like to break old udev versions. I think upcoming Debian Squeeze requires 2.6.27 at a minimum. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Compiling lustre 1.8.X on Ubuntu LTS 10.04
On Tuesday 04 May 2010, Ramiro Alba Queipo wrote: Hi everybody, I would like if anybody is trying to compile lustre 1.8.X on Ubuntu LTS 10.4/Debian testing(squeeze), and know your opinion/comments on what I've got: I've been using lustre 1.8.1.1 with RedHat5 kernel 2.6.18-128.7.1 on Ubuntu LTS 8.04 both serves and clients, but now I would like to upgrade to the recent Ubuntu LTS 10.04. When I try to compile lustre 1.8.1.1 on Ubuntu 10.04 (gcc-4.4 and libc6 2.11.1 instead gcc-4.2 and libc6-2.7.10 on Ubuntu 8.04) and once suppressed all references in to -Werror in configure script (I tried --disable-werror, but it did not work), I finally got: That is bug 22729. A very simple patch (entirely untested) should be: diff --git a/lnet/include/libcfs/linux/kp30.h b/lnet/include/libcfs/linux/kp30.h --- a/lnet/include/libcfs/linux/kp30.h +++ b/lnet/include/libcfs/linux/kp30.h @@ -386,17 +386,8 @@ extern int lwt_snapshot (cycles_t *now, # define LPF64 l #endif -#ifdef HAVE_SIZE_T_LONG -# define LPSZ %lu -#else -# define LPSZ %u -#endif - -#ifdef HAVE_SSIZE_T_LONG -# define LPSSZ %ld -#else -# define LPSSZ %d -#endif +#define LPSZ %zd +#define LPSSZ %zd #ifndef LPU64 # error No word size defined Please note that I did not test this patch at all yet. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Compiling lustre 1.8.X on Ubuntu LTS 10.04
On Tuesday 04 May 2010, Ramiro Alba Queipo wrote: On Tue, 2010-05-04 at 14:16 +0200, Bernd Schubert wrote: That is bug 22729. A very simple patch (entirely untested) should be: diff --git a/lnet/include/libcfs/linux/kp30.h b/lnet/include/libcfs/linux/kp30.h --- a/lnet/include/libcfs/linux/kp30.h +++ b/lnet/include/libcfs/linux/kp30.h @@ -386,17 +386,8 @@ extern int lwt_snapshot (cycles_t *now, # define LPF64 l #endif -#ifdef HAVE_SIZE_T_LONG -# define LPSZ %lu -#else -# define LPSZ %u -#endif - -#ifdef HAVE_SSIZE_T_LONG -# define LPSSZ %ld -#else -# define LPSSZ %d -#endif +#define LPSZ %zd +#define LPSSZ %zd #ifndef LPU64 # error No word size defined Thanks Bernd. Now, it compiles Please note that I did not test this patch at all yet. I'll follow the bug and test yours Now a couple of questions: 1) I've compiled RedHat5 2.6.18-164.11.1 kernel using config from file config-2.6.18-164.11.1.el5_lustre.1.8.3 extracted from package kernel-2.6.18-164.11.1.el5_lustre.1.8.3.x86_64-ext4.rpm from Oracle server that says: Lustre-patched kernel for ext4(MDS/MGS/OSS Only) Is it the right one? I don't think there is a right or a wrong. For RHEL 5.4 kernels ldiskfs is either based on ext3 or ext4. The ext3 based version is better tested (with default options), the ext4 based version has more features (e.g. 16TiB OST size support). 2) By looking at infiniband libraries in Ubuntu LTS 10.04/Debian testing I could see that there are mainly OFED 1.4.2 except except libibverbs and librdmacm packages which seem to be from OFED 1.5.1 (libibverbs) and 1.5 (librdmaca). I suppose it have done this way, due to the 2.6.32 kernel (containing OFED 1.5.1 as you can see in docs/OFED_release_notes.txt) and openmpi 1.4.1 (coming with OFED 1.5.1), but I afraid of having problems when using Lustre 1.8.3 on it. I asked Debian maintainer Roland Dreier but he did not answer. Should I worry? You do not need to care about IB libraries in Lustre at all. Lustre only accesses the kernel interface. Also, OFED is only a stable collection of different libraries and utils. Mixing version is allowed, but is not as well tested as the combination provided by OFED. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] LBUG: ost_rw_hpreq_check() ASSERTION(nb != NULL) failed
Hello Erich, check out my bug report: https://bugzilla.lustre.org/show_bug.cgi?id=19992 It was closed as duplicate of bug 16129, although that is probably not correct, as 16129 is the root cause, but not the solution. As we never observed it with 1.6.7.2 I didn't complain bug 19992 was closed. As you now can confirm it also happens with 1.6.7.2, please re-open that bug. Thanks, Bernd On Monday 19 April 2010, Erich Focht wrote: Hi, we saw this LBUG 3 times within past week, and are puzzled of what's going on, and how comes there's no bugzilla entry for this... What happens is that on an OSS a request (must be read or write) expects (according to the content of the ioobj structure) to find an array of 22 struct niobuf_remote's (niocount), but only finds one. This is obviously corrupted. We enabled checksumming where we could, but unfortunately the request headers don't seem to be covered by any checksum check (well, the reply path possibly is). Anyway, we see no corruption/checksum failures for bulk data transfer, so it's improbable that this is a corruption on the wire, that three times in a row says size 16 too small (required X) (with X being 352, 432, 4016 in our failures). Did anybody see this? Any ideas or hints? We're using Lustre 1.6.7.2 on server and client side. The LBUG traceback is: LustreError: 12946:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg 8101d0c4aad0 buffer[3] size 16 too small (required 352) LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) ASSERTION(nb != NULL) failed LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) LBUG Lustre: 12946:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack for process 12946 ll_ost_io_135 R running task 0 12946 1 12947 12945 (L-TLB) 88574438 88abb2e0 063a 8101d0c4ac28 88abb2e0 88571c20 88574a35 88abc7e2 0016 Call Trace: [88571c20] :libcfs:tracefile_init+0x0/0x110 [88aac641] :ost:ost_rw_hpreq_check+0x1b1/0x290 [88ab9ebf] :ost:ost_hpreq_handler+0x50f/0x7c0 [886d243b] :ptlrpc:ptlrpc_main+0xebb/0x13e0 [8008a4aa] default_wake_function+0x0/0xe [800b4a6d] audit_syscall_exit+0x327/0x342 [8005dfb1] child_rip+0xa/0x11 [886d1580] :ptlrpc:ptlrpc_main+0x0/0x13e0 [8005dfa7] child_rip+0x0/0x11 Regards, Erich ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lost Files - How to remove from MDT
On Sunday 18 April 2010, Charles Taylor wrote: On Apr 18, 2010, at 9:38 AM, Brian J. Murrell wrote: On Sun, 2010-04-18 at 09:30 -0400, Charles Taylor wrote: Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? lfsck is the documented, supported method. Yes, but we attempted that at one time with a smaller file system (for a different reason). After letting it run for over a day, we estimated that it would have taken seven to ten days to finish. That just wasn't practical for us at the time and still isn't. This file system would probably take a couple of weeks to lfsck. I'm sorry to say we can't take the file system offline for that long. You don't need to take the filesystem offline for lfsck. Also, I have rewritten large parts of lfsck and also fixed the parallelization code. I need to review all patches again and probably also make a hg or git repository out of it. Unfortunately, I always have more tasks to do than I manage to do... But given the fact that I fixed several bugs and added safety checks, I think my version actually is better than upstream. Let me know if you are interested and I can put a tar ball of e2fsprogs-sun-ddn on my home page. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Extremely high load and hanging processes on a Lustre client
On Friday 05 March 2010, Götz Waschk wrote: Hi everyone, I have a critical problem on one of my Lustre client machines running Scientific Linux 5.4 and the patchless Lustre 1.8.2 client. After a few days of usage, some processes like cp and kswapd0 start to use 100% CPU. Only 180k of swap space are in use though. Processes that try to access Lustre use a lot of CPU and seem to hang. There is some output in the kernel log I'll attach to this mail. Do you have any idea what to test before rebooting the machine? Don't reboot, but disable LRU resizing. for i in /proc/fs/lustre/ldlm/namespaces/*; do echo 800 ${i}/lru_size; done At least that helped all the time before when we had that problem. I hoped it would be fixed in 1.8.2, but seems it is not. Please open a bug report. Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Rx failures
On Thursday 11 February 2010, Ulrich Sibiller wrote: Ulrich Sibiller schrieb: Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-192.168.60@o2ib: -113 Feb 2 16:08:19 hpc9oss1 kernel: Lustre: 7937:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn stale 192.168.60@o2ib [old ver: 12, new ver: 12] Feb 2 15:59:27 hpc9mds1 kernel: Lustre: 5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn race 192.168.60@o2ib For the records: Finally I found the source of these problems: We had two IPoIB interfaces in the fabric using the same IP address (192.168.60.226)... I guess next time you should run a lnet_selftest and lctl ping. Greetings from Tübingen, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Filesystem monitoring in Heartbeat
On Thursday 21 January 2010, Adam Gandelman wrote: Jagga Soorma wrote: Hi Guys, My MDT is setup with LVM and I was able to test failover based on the Volume Group failing on my MDS (by unplugging both fibre cables). However, for my OST's, I have created filesystems directly on the SAN luns and when I unplug the fibre cables on my OSS, heartbeat does not detect failure for the filesystem since it shows as mounted. Is there somehow we can trigger a failure based on multipath failing on the OSS? Hi- It would depend on the version of heartbeat you are using. Heartbeat v1 did not do any resource level monitoring and if that is what you are using you are out of luck. If using v2 CRM and/or Pacemaker, you have two options: 1, Modify the Filesystem OCF script's monitor operation to check the actual health of the filesystem and/or multipath in addition to the status of the mount and return accordingly. The Filesystem OCF agent is located at /usr/lib/ocf/resource.d/heartbeat/Filesystem 2, Create your own resource agent that interacts with dm/multipath to start/stop/monitor it. Then constrain the resource to start before/stop after and run with the Filesystem resource. Then the filesystem will be dependent on the health of the multipath resource. I guess you want to use the pacemaker agent I posted into this bugzilla: https://bugzilla.lustre.org/show_bug.cgi?id=20807 It does not interact with with multipath, but knows about several lustre details. How would you monitor multipath? If one of your several paths fails, what do you want to do? If all paths fail, it is clear, but what to for a partial path failure? I think think OCF defines a return code for that? I also think mutipath should be a separate agent to reduce complicity from the script. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre claims OST is mounted when it is not
On Friday 15 January 2010, Erik Froese wrote: We had an OSS lockup and had to be reset. Heartbeat failed to mount one of the OSTs and unmounted all of its local OSTs. I'm trying to run mount on one of the OSTs (ost08) but it claims its mounted when it is not. I have other OSTs mounted so I can't remove the driver right now. Any ideas? Redhat 5.3 [r...@oss-0-0 ~]# uname -a Linux oss-0-0.local 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Oct 6 05:48:57 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux [r...@oss-0-0 ~]# mount | grep ost /dev/dsk/ost12 on /mnt/scratch/ost12 type lustre (rw) /dev/dsk/ost16 on /mnt/scratch/ost16 type lustre (rw) /dev/dsk/ost20 on /mnt/scratch/ost20 type lustre (rw) /dev/dsk/ost00 on /mnt/scratch/ost00 type lustre (rw) /dev/dsk/ost04 on /mnt/scratch/ost04 type lustre (rw) /dev/dsk/ost110 on /mnt/scratch/ost110 type lustre (rw) [r...@oss-0-0 ~]# umount -f /mnt/scratch/ost08 umount2: Invalid argument umount: /mnt/scratch/ost08: not mounted [r...@oss-0-0 ~]# e2fsck -n /dev/dsk/ost08 | tee /state/partition1/e2fsck-n.ost08_`date '+%m.%d.%y-%H:%M:%S'`.log e2fsck 1.41.6.sun1 (30-May-2009) device /dev/sdj mounted by lustre per /proc/fs/lustre/obdfilter/scratch-OST0018/mntdev Warning! /dev/dsk/ost08 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. see here: https://bugzilla.lustre.org/show_bug.cgi?id=19566 https://bugzilla.lustre.org/show_bug.cgi?id=21359 -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre claims OST is mounted when it is not
Hello Erik, unfortunately, there is no solution than to reboot. For some unknown (yet to debug reasons) variable references could not given up, so in order to prevent NULL point dereferences, Lustre did not umount. Cheers, Bernd On Friday 15 January 2010, Erik Froese wrote: Thanks Bernd. From the bug reports it looks like the OST is actually still mounted by lustre, unbeknownst to Linux and VFS. Is there a mechanism to unmount it or do I need to reboot? Erik On Fri, Jan 15, 2010 at 3:28 PM, Bernd Schubert bs_li...@aakef.fastmail.fmwrote: On Friday 15 January 2010, Erik Froese wrote: We had an OSS lockup and had to be reset. Heartbeat failed to mount one of the OSTs and unmounted all of its local OSTs. I'm trying to run mount on one of the OSTs (ost08) but it claims its mounted when it is not. I have other OSTs mounted so I can't remove the driver right now. Any ideas? Redhat 5.3 [r...@oss-0-0 ~]# uname -a Linux oss-0-0.local 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Oct 6 05:48:57 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux [r...@oss-0-0 ~]# mount | grep ost /dev/dsk/ost12 on /mnt/scratch/ost12 type lustre (rw) /dev/dsk/ost16 on /mnt/scratch/ost16 type lustre (rw) /dev/dsk/ost20 on /mnt/scratch/ost20 type lustre (rw) /dev/dsk/ost00 on /mnt/scratch/ost00 type lustre (rw) /dev/dsk/ost04 on /mnt/scratch/ost04 type lustre (rw) /dev/dsk/ost110 on /mnt/scratch/ost110 type lustre (rw) [r...@oss-0-0 ~]# umount -f /mnt/scratch/ost08 umount2: Invalid argument umount: /mnt/scratch/ost08: not mounted [r...@oss-0-0 ~]# e2fsck -n /dev/dsk/ost08 | tee /state/partition1/e2fsck-n.ost08_`date '+%m.%d.%y-%H:%M:%S'`.log e2fsck 1.41.6.sun1 (30-May-2009) device /dev/sdj mounted by lustre per /proc/fs/lustre/obdfilter/scratch-OST0018/mntdev Warning! /dev/dsk/ost08 is mounted. Warning: skipping journal recovery because doing a read-only filesystem check. see here: https://bugzilla.lustre.org/show_bug.cgi?id=19566 https://bugzilla.lustre.org/show_bug.cgi?id=21359 -- Bernd Schubert DataDirect Networks -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] No space left on device for just one file
3206259721% /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010 5222114 324204896 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820 5115591 3128062291% /lustre/scratch[OST:26] scratch-OST001b_UUID 366288896 5353229 3609356671% /lustre/scratch[OST:27] scratch-OST001c_UUID 366288896 5383473 3609054231% /lustre/scratch[OST:28] scratch-OST001d_UUID 366288896 5411890 3608770061% /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615 617 210047728 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896 6465049 3598238471% /lustre/scratch[OST:31] filesystem summary: 1453492963 174078773 1279414190 11% /lustre/scratch Thanks, Mike Robbert On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote: On 2010-01-11, at 15:59, Michael Robbert wrote: The filename is not very unique. I can create a file with the same name in another directory or on another Lustre filesystem. It is just this exact path on this filesystem. The full path is: /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP. 007100.pfb.00164 The mount point for this filesystem is /lustre/scratch/ Robert, does the same problem happen on multiple client nodes, or is it only happening on a single client? Are there any messages on the MDS and/ or the OSSes when this problem is happening? This problem is somewhat unusual, since I'm not aware of any places outside the disk filesystem code that would cause ENOSPC when creating a file. Can you please do a bit of debugging on the system: {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11 {mds,client}# echo -1 /proc/sys/lustre/debug # enable full debug {mds,client}# lctl clear # clear debug logs {client}# touch NLDAS.APCP.007100.pfb.00164 {mds,client}# lctl dk /tmp/debug.{mds,client} # dump debug logs For now, please extract the ENOSPC error from the logs will be much shorter, and may be enough to identify where the problem is located, and will be a lot friendlier to the list. grep -- -28 /tmp/debug.{mds,client} /tmp/debug-28.{mds,client}:: along with the lfs df and lfs df -i output. If this is only on a single client, just dropping the locks on the client might be enough to resolve the problem: for L in /proc/fs/lustre/ldlm/namespaces/*; do echo clear $L done If, on the other hand, this same problem is happening on all clients then the problem is likely on the MDS. On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert mrobb...@mines.edu wrote: I have a user that reported a problem creating a file on our Lustre filesystem. When I investigated I found that the problem appears to be unique to just one filename in one directory. I have tried numerous ways of creating the file including echo, touch, and lfs setstripe all return No space left on device. I have checked the filesystem with df and lfs df both show that the filesystem and all OSTs are far from being full for both blocks and inodes. Slight changes in the filename are created fine. We had a kernel panic on the MDS yesterday and it was quite possible that the user had a compute job working in this directory at the time of that problem. I am guessing we have some kind of corruption with the directory. This directory has around 1 million files so moving the data around may not be a quick operation, but we're willing to do it. I just want to know the best way, short of taking the filesystem offline, to fix this problem. Any ideas? Thanks in advance, Mike Robbert ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] No space left on device for just one file
Hello Robert, could you please send a mail into our ticket system? Kit or I would then start to investigate tomorrow. Thanks, Bernd On Monday 11 January 2010, Michael Robbert wrote: The filename is not very unique. I can create a file with the same name in another directory or on another Lustre filesystem. It is just this exact path on this filesystem. The full path is: /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP.007100.pfb .00164 The mount point for this filesystem is /lustre/scratch/ Thanks, Mike On Jan 11, 2010, at 5:52 AM, Mag Gam wrote: Can you paste us the file name? I want to see if we can touch something like this. On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert mrobb...@mines.edu wrote: I have a user that reported a problem creating a file on our Lustre filesystem. When I investigated I found that the problem appears to be unique to just one filename in one directory. I have tried numerous ways of creating the file including echo, touch, and lfs setstripe all return No space left on device. I have checked the filesystem with df and lfs df both show that the filesystem and all OSTs are far from being full for both blocks and inodes. Slight changes in the filename are created fine. We had a kernel panic on the MDS yesterday and it was quite possible that the user had a compute job working in this directory at the time of that problem. I am guessing we have some kind of corruption with the directory. This directory has around 1 million files so moving the data around may not be a quick operation, but we're willing to do it. I just want to know the best way, short of taking the filesystem offline, to fix this problem. Any ideas? Thanks in advance, Mike Robbert ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] failover problems using separated journal disk
Hello Antonio, On Wednesday 23 December 2009, Antonio Concas wrote: Hi, all Dec 23 11:20:29 mommoti12 kernel: LDISKFS-fs: external journal has bad superblock see here: https://bugzilla.lustre.org/show_bug.cgi?id=21389 Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic
Hello Nick, at least I'm not aware on any drawbacks. Cheers, Bernd On Tuesday 22 December 2009, Nick Jennings wrote: Thanks for this tip Bernd. I'll be unable to upgrade for a while, so this is a very useful workaround. Does it have any drawbacks I should be aware of? On 12/22/2009 12:35 AM, Bernd Schubert wrote: On Monday 21 December 2009, Andreas Dilger wrote: On 2009-12-21, at 11:15, Nick Jennings wrote: I had another instance of the client kernel panic which I first encountered a few months ago. This time I managed to get a shot of the console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the JPG is from the console of wsn1(client). I see bug 19841, which has at least part of this stack (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614. The latter bug is marked landed for 1.8.0 and later releases. Nick, if you do not want to upgrade or patch your Lustre version, the workaround for this is to disable lockless truncates. # on all clients for i in /proc/fs/lustre/llite/*; do echo 0 ${i}/lockless_truncate; done Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic
On Tuesday 22 December 2009, Nick Jennings wrote: On 12/21/2009 07:36 PM, Brian J. Murrell wrote: Photographs of 25 line console screens are not very often suitable substitutes for real console logging, unfortunately. Seriously, if you really want to pursue this issue, you are going to have to set up some form of console logging. I think netconsole is usually fairly successful at capturing kernel oops dumps. Maybe that's an option. ISTR mentioning netconsole the last time though. Maybe that was another thread. You're right, I just hadn't gotten around to getting netconsole set up like I planned. *blush* :) Most servers nowadays have IPMI and an IPMI SOL is much better. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic
On Tuesday 22 December 2009, David Dillow wrote: On Tue, 2009-12-22 at 18:09 +0100, Bernd Schubert wrote: On Tuesday 22 December 2009, Nick Jennings wrote: On 12/21/2009 07:36 PM, Brian J. Murrell wrote: Photographs of 25 line console screens are not very often suitable substitutes for real console logging, unfortunately. Seriously, if you really want to pursue this issue, you are going to have to set up some form of console logging. I think netconsole is usually fairly successful at capturing kernel oops dumps. Maybe that's an option. ISTR mentioning netconsole the last time though. Maybe that was another thread. You're right, I just hadn't gotten around to getting netconsole set up like I planned. *blush* :) Most servers nowadays have IPMI and an IPMI SOL is much better. Heh, I'd like to know what servers you are running. Our experience with IPMI SOL on a variety of systems has been anything but reliable. It has a notorious habit of dropping out under any sort of load, such as during an oops where you need it the most. It's still better than nothing, but it's a crapshoot. Yes, I know about IPMI issues, of course. In my experience, SuperMicro IPMI with an additional NIC port works perfectly. I don't know about their most recent mainboards and BMCs, though. The first week I started for DDN I learned that Dell-DRAC5 has a bug and does not send a break (sysrq). According to Dell, this is fixed in their recent firmware released 7 days ago (I opened the 'priority' call on March 6th), but I could not check yet. Also working rather well is HP ilO, although not with SOL, but their build in vsp. The problem with vsp is that cursor-keys do not work and navigating through the grub-menu is a pain, unless you know emacs shortcuts in and out (I'm a vi user...). Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Implementing MMP correctly
Michael, to answer your question on the pacemaker mailing list, if you do use the an agent that also checks for all umount bugs, it might work without mmp, but you still remove a very useful protection. And the situation didn't change since October when you asked a similar question last time ;) On Tuesday 22 December 2009, Jim Garlick wrote: On Tue, Dec 22, 2009 at 02:12:44PM +0100, Michael Schwartzkopff wrote: Hi, I am trying to understand howto implement MMP correctly into a lustre failover cluster. As far as I understood the MMP protects the same filesystem beeing mounted by different nodes (OSS) of a failover cluster. So far so good. If a node was shut down uncleanly it still will occupy its filesystems by MMP and thus preventing the clean failover to an other node. How did you get this idea at all? Hi, ldiskfs (or e2fsck) will poll the MMP block to see if the other side is still updating it before starting. If updates have ceased, the mount or fsck will start. So the workarounds below are unnecessary. Now I want to implement a clean failover into the Filesystem Resource Agent of pacemaker. Is there a good way to solve the problem with MMP? Possible sotutions are: - Disable the MMP feature in a cluster at all, since the resource manager takes care that the same resource is only mounted once in the cluster - Do a tunefs -O ^mmp device and a tunefs -O mmp device before every mounting of a resource? tune2fs -Eclear-mmp is a faster alternative. Doing that for each and every mount would basically remove the MMP protection. I think Michael wants to write an agent that does that automatically... And again, I submitted and updated a suitable agent in Lustre bugzilla 20807. It is almost ready to be submitted to heartbeat/pacemaker, I only need to clean up some comments and slightly simplify some umount checks. Should only be necessary if e2fsck is interrupted. (e2fsck does not regularly update the MMP block like the file system does) - Do a sleep 10 before mounting a resource? But the manual says the file system mount require additional time if the file system was not cleanly unmounted. It will require more time for a journal replay, I guess. - Check if the file system is in use by another OSS through MMP and wait a litte bit longer? How do I do this? Not necessary. All DDN Lustre installations in Europe are now based on pacemaker, without any ugly workarounds. But then as I told you before, our releases also fix bug 19566 already. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic
On Monday 21 December 2009, Andreas Dilger wrote: On 2009-12-21, at 11:15, Nick Jennings wrote: I had another instance of the client kernel panic which I first encountered a few months ago. This time I managed to get a shot of the console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the JPG is from the console of wsn1(client). I see bug 19841, which has at least part of this stack (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614. The latter bug is marked landed for 1.8.0 and later releases. Nick, if you do not want to upgrade or patch your Lustre version, the workaround for this is to disable lockless truncates. # on all clients for i in /proc/fs/lustre/llite/*; do echo 0 ${i}/lockless_truncate; done Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] async journals
Hello, I'm presently a bit puzzled about asynchronous journal patches. While I was just reading the jbd-journal-chksum-rhel53.patch patch, I noticed it also adds a new option and feature journal_async_commit. But then ever since lustre-1.8.0 there is also a patch included for async journals from obdfilter. This patch is presently disabled, since it could cause data corruption on fail over. I now wonder how these two patches/features are related, so jbd/ldiskfs/ext4 (journal_async_commit vs. obdfilter (obdfilter.*.sync_journal=0). When I did some tests with lctl set_param obdfilter.*.sync_journal=0, it even slightly reduced performance. So I wonder if one additionally needs to enable jbd-async journals? Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] OST I/O problems
On Friday 04 December 2009, Heiko Schröter wrote: Hello, we do see those messages (see below) on our OSTs when under heavy _read_ load (or when 60+ Jobs are trying to read data at approx the same time). The OSTs freezes and even console output is down to a few bytes the minute. After some time the OSTs do revocer. ler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s r...@81007efa7e00 x7869690/t0 This error message means You have a flaky network. For example it comes up if you set a high MTU, but your switch does not support it. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] how to define 60 failnodes
On Monday 09 November 2009, Brian J. Murrell wrote: Theoretically. I had discussed this briefly with another engineer a while ago and IIRC, the result of the discussion was that there was nothing inherent in the configuration logic that would prevent one from having more than two (primary and failover) OSSes providing service to an OST. Two nodes per OST is how just about everyone that wants failover configures Lustre. Not everyone ;) And especially it doesn't make sense to have a 2 node failover scheme with pacemaker: https://bugzilla.lustre.org/show_bug.cgi?id=20964 -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Support for vanilla kernels in lustre servers
On Saturday 31 October 2009, Mag Gam wrote: if I were to deploy a system now and I want to do the kernel compile way, what kernel do you recommend? I prefer using 1.6.7.2 because of its stability... Sun is very helpful and provide distribution kernels as tar.bz2 on their download page: http://downloads.lustre.org/public/kernels/ So instead of going through the pain to get that yourself from the vendors src.rpm Sun already greatly helps (so far I have not found an easy how to do that myself, any hint from the guy providing the tar files would be highly appreciated). In a perfect world, this page also would state which kernel is suitable for which Lustre version, e.g. lustre-1.6.7.2 linux-2.6.18-92.1.10.el5.tar.bz2 lustre-1.8.1.1 linux-2.6.18-128.7.1.el5.tar.bz2 Also missing are the .config files. I usually extract these from the kernel binary packages - I need to download 150MB to get the 4KB config file *sigh*. And better don't try to change options in non-vanilla kernels, this very often fails, because the vendor doesn't support it and so also doesn't test it. Btw, of course these kernels also work for different distributions, so instead of going through the pain to port Lustre to Ubuntu or Debian kernels, I simply started to create debian packages for the RHEL5 kernels http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/lustre/1.6/debs/lustre-clients/1.6.7.2-ddn2/ linux-image-2.6.18-128.7.1.el5_1_amd64.deb linux-headers-2.6.18-128.7.1.el5_1_amd64.deb Cheers, Bernd PS: Disclaimer: Whatever packages you may find on my home page, I won't provide support for these! -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] intrepid kernel on jaunty
On Thursday 29 October 2009, Ralf Utermann wrote: Papp Tamás schrieb: Papp Tamás wrote, On 2009. 10. 28. 22:01: Brian J. Murrell wrote, On 2009. 10. 28. 20:26: Additionally, b1_8 (aka 1.8.2) also has debian packaging support (on an unsupported basis at this point) in /debian. Here are the installed packages, built from the /debian in b1_8 from my client: ii lustre-client-modules-2.6.28-11-generic1.8.1.50-2 Lustre Linux kernel module (kernel 2.6.28-11 ii lustre-tests 1.8.1.50-2 Test suite for the Lustre filesystem ii lustre-utils 1.8.1.50-2 Userspace utilities for the Lustre filesyste Somehow I could make the debs, but not the modules, only these: liblustre_1.8.1.50-1_amd64.deb lustre-dev_1.8.1.50-1_amd64.deb lustre-tests_1.8.1.50-1_amd64.deb linux-patch-lustre_1.8.1.50-1_all.deb lustre-source_1.8.1.50-1_all.deb lustre-utils_1.8.1.50-1_amd64.deb then install lustre-source_1.8.1.50-1_all.deb, and build the modules using 'm-a build lustre ' . You might need to chmod +x /usr/src/modules/lustre/debian/rules . I might be the only one complaining, but I think the method I proposed with attachment 24529 (https://bugzilla.lustre.org/attachment.cgi?id=24529) was more convenient, as least when you just want to build packages, e.g. in a chroot. Just autotools support was missing to automatically do the sed steps. And no, when Brian was working on this, I really didn't have the time to step in, I had been already far above 16 hours per day that time. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] more on MGS and MDT separation
On Thursday 22 October 2009, Ms. Megan Larko wrote: Greetings, I am deactivating an older Lustre filesystem in favor of a newer one (already up and stable). The message in Lustre-discuss Digest Vol 45 Issue 22 stated (with two of my comments in-line): Message: 2 Date: Sun, 11 Oct 2009 19:05:07 +0200 From: Bernd Schubert bs_li...@aakef.fastmail.fm Subject: Re: [Lustre-discuss] Moving MGS to separate device To: lustre-discuss@lists.lustre.org Message-ID: 200910111905.08076.bs_li...@aakef.fastmail.fm Content-Type: Text/Plain; charset=iso-8859-15 Hello Wojciech, I already did this several times, here are the steps I so far used: 1) Remove MGS from MDT-device tunefs.lustre --nogms /dec/mdt_device Megan: I am assuming --nomgs here. Yes, sorry a typo. 2) Create new MGS mkfs.lustre --mgs /dev/mgs_device 3) Make sure OSTs and MDTs re-register with the MGS: tunefs.lustre --writeconf /dev/device Megan: Do I need to do this even if the MGS is being moved from a shared device with an MDT to its own device/hard drive on the same physical server (same MAC addr, IP, hostname etc.)? I did it to make sure MDT and OSTs re-register with the MGS, so to make sure the MGS really knows about them. In the end the MGS is only there to know about OSTs and MDTs and to provide those information to clients. I think on mounting of MDTs and OSTs they always contact the MGS, but if the MGS is not available they will give up, so it **might** happen that they wouldn't contact the MGS and so wouldn't be registered. --writeconf will ensure that. As I also wrote, it might work to simply copy the CONFIGS directory, but I didn't test that yet. I'm not sure if writeconf is really necessary, but so far I always did it to make sure everything goes smoothly (clients shouldn't have the filesystem mounted at this step). 5) Mount MGS, MDT, OSTs 4) Re-apply settings done with lctl. Megan: Why are the above ordered the way they are? Shouldn't I mount first and then apply the settings? (I didn't think I could lctl an unmounted OST/MDT etc.) And another typo, actually even two. lctl settings can be applied only with the filesystem being mounted. So 4) Mount MGS, MDT, OSTs 5) Re-apply settings done with lctl. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] 1.8.1 test setup achieved, what about maximum mdt size
On Tuesday 20 October 2009, Andreas Dilger wrote: On 18-Oct-09, at 16:04, Piotr Wadas wrote: Now, I did a simple count of MDT size as described in lustre 1.8.1 manual, and setup mdt as recommended. The question is, no matter I did right count or not, what actually will happen, if MDT partition runs out of space? Any chances to dump the whole MGS+MDT combined fs, supply a bigger block device, or extend partition size with some e2fsprogs/tune2fs trick ? This assumes, that no matter how big MDT is, it will be exhausted someday. It is true that the MDT device can become full at some point, but this happens fairly rarely given that most Lustre HPC users have very large files, and the size of the MDT is MUCH smaller than the space needed for the file data. The maximum size of MDT is 8TB, and if you format the Is that still true with recent kernels such as the one from SLES11? I thought ldiskfs is based on ext4 there? So we should have at least 16TiB and I'm not sure if all the e2fsprogs patches already have been landed to get 64-bit max sizes? Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Understanding of MMP
On Monday 19 October 2009, Andreas Dilger wrote: On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote: perhaps I have a problem understanding multiple mount protection MMP. I have a cluster. When a failover happens sometimes I get the log entry: Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: Device is already active on another node. Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP failure info: last update time: 1255958168, last update node: sososd3, last update device: dm-2 Does the second line mean that my node (sososd7) tried to mount /dev/ dm-2 but MMP prevented it from doing so because the last update from the old node (sososd3) was too recent? The update time stored in the MMP block is purely for informational purposes. It actually uses a sequence counter that has nothing to do with the system clock on either of the nodes (since they may not be in sync). What that message actually means is that sososd7 tried to mount the filesystem on dm-2 (which likely has another LVM name that the kernel doesn't know anything about) but the MMP block on the disk was modified by sososd3 AFTER sososd7 first looked at it. Probably, bug#19566. Michael, which Lustre version do you exactly use? Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Understanding of MMP
On Monday 19 October 2009, Michael Schwartzkopff wrote: Am Montag, 19. Oktober 2009 20:42:19 schrieben Sie: On Monday 19 October 2009, Andreas Dilger wrote: On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote: perhaps I have a problem understanding multiple mount protection MMP. I have a cluster. When a failover happens sometimes I get the log entry: Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: Device is already active on another node. Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2): ldiskfs_multi_mount_protect: MMP failure info: last update time: 1255958168, last update node: sososd3, last update device: dm-2 Does the second line mean that my node (sososd7) tried to mount /dev/ dm-2 but MMP prevented it from doing so because the last update from the old node (sososd3) was too recent? The update time stored in the MMP block is purely for informational purposes. It actually uses a sequence counter that has nothing to do with the system clock on either of the nodes (since they may not be in sync). What that message actually means is that sososd7 tried to mount the filesystem on dm-2 (which likely has another LVM name that the kernel doesn't know anything about) but the MMP block on the disk was modified by sososd3 AFTER sososd7 first looked at it. Probably, bug#19566. Michael, which Lustre version do you exactly use? Thanks, Bernd I got version 1.8.1.1 which was published last week. Is the fix included or only in 1.8.2? According to the bugzilla (https://bugzilla.lustre.org/show_bug.cgi?id=19566) not yet in 1.8.1.1. Our ddn internal releases of course do have it. And from my point of view this is a really important fix. Ever since 1.6.7 there is also no chance anymore to figure out the unsuccessful umount from the resource agent (up to 1.6.6 /proc/fs/lustre/.../mntdev would tell you the device is still mounted). To be sure this really your issue, do you see this in your kernel logs? CERROR(Mount %p is still busy (%d refs), giving up.\n, mnt, atomic_read(mnt-mnt_count)); -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Problem re-mounting Lustre on an other node
On Wednesday 14 October 2009, Michael Schwartzkopff wrote: Hi, we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster manager. When I migrate one lustre resource from one node to an other node I get an error. Stopping lustre on one node is no problem, but the node where lustre should start says: Oct 14 09:54:28 sososd6 kernel: kjournald starting. Commit interval 5 seconds Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal journal Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete. Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent) Oct 14 09:54:39 sososd6 kernel: kjournald starting. Commit interval 5 seconds Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal journal Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered data mode. Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled Oct 14 09:54:39 sososd6 kernel: Lustre: mgc134.171.16@tcp: Reactivating [...] These log continue until the cluster software times out and the resource tells me about the error. Any help understanding these logs? Thanks. What is your start timeout? Do you see mount in the process list? I guess you just need to increase the timeout, I usually set at least 10 minutes, sometimes even 20 minutes. Also see my bug report and if possible add further information yourself. https://bugzilla.lustre.org/show_bug.cgi?id=20402 Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Problem re-mounting Lustre on an other node
On Wednesday 14 October 2009, Michael Schwartzkopff wrote: We have timeouts of 60 seconds. But we will move to 300. Thanks for the hint. Check out my bug report, that might not be sufficient. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?
On Tuesday 13 October 2009, Andreas Dilger wrote: On 12-Oct-09, at 12:11, Lundgren, Andrew wrote: I have tried using: # lctl conf_param content-MDT.osc.lru_size=800 Seen this in the log: Oct 12 18:35:36 abcd0202 kernel: Lustre: Modifying parameter content- MDT-mdc.osc.lru_size in log content-client Oct 12 18:35:36 abcd0202 kernel: Lustre: Skipped 1 previous similar message But then on the clients, the lru_size doesn't seem to change: OSS # cat ./fs/lustre/ldlm/namespaces/*/lru_size 33 0 0 0 1 0 1 200 I have also set it for the OST individually from the MDS. It doesn't seem to do anything for the other machines. Is this a permanently tunable parameter, or am I just specifying the wrong setting? My apologies. Any parameter settable in a /proc/fs/lustre/ file can usually be specified as obd|fsname.obdtype.proc_file_name=value, e.g.: Thanks, I think this should go into the the man page of lctl. * tunefs.lustre --param mdt.group_upcall=NONE /dev/sda1 * lctl conf_param testfs-MDT.mdt.group_upcall=NONE * lctl conf_param testfs.llite.max_read_ahead_mb=16 * ... testfs-MDT.lov.stripesize=2M * ... testfs-OST.osc.max_dirty_mb=29.15 * ... testfs-OST.ost.client_cache_seconds=15 * ... testfs.sys.timeout=40 However, it isn't currently possible to specify a conf_param tunable for ldlm settings, since they do not have their own OBD device and the tunable code is (unfortunately) slightly different than other parts of the Lustre proc tunables. Ages and ages ago, this was done because the externally-contributed lprocfs code was very buggy and we wanted to make sure that the ldlm /proc tunables (which were, at the time, the only ones that were actually required for Lustre functionality) would continue working while lprocfs was disabled until fixed. Until now, there was no reason to change that code, but it makes sense to fix that now... Could you file a bug on this? Done, bug 21084 Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?
On Saturday 10 October 2009, Andreas Dilger wrote: On 8-Oct-09, at 22:28, Lundgren, Andrew wrote: Is there a way to set the lru_size to a fixed value and have it stay that way across mounts? I know it can be set using: $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) But that isn’t retained across a reboot. lctl set_param is only for temporary tunable setting. You can use lctl conf_param to set a permanent tunable. Would you mind to provide an example line? I never understood the logic of lctl conf_param. This fails: lctl conf_param ldlm.namespaces.testfs-OST0002-osc.lru_size=800 error: conf_param: Invalid argument Thanks, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Setup mail cluster
On Monday 12 October 2009, Michael Schwartzkopff wrote: Am Montag, 12. Oktober 2009 15:54:04 schrieb Vadym: Hello I'm do a schema of mail service so I have only one question: Can Lustre provide me full automatic failover solution? No. See the lustre manual for this. You need a cluster solution for this. The manual is *hopelessly* outdate at this point. Do NOT user heartbeat any more. User pacemaker as the cluster manager. See www.clusterlabs.org. When I find some time I want to write a HOWTO about setting up a lustre clsuter with pacemaker and OpenAIS. Also see bug 20807 (https://bugzilla.lustre.org/show_bug.cgi?id=20807) for a pacemaker agent. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?
On Saturday 10 October 2009, Andreas Dilger wrote: On 8-Oct-09, at 22:28, Lundgren, Andrew wrote: Is there a way to set the lru_size to a fixed value and have it stay that way across mounts? I know it can be set using: $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) But that isn’t retained across a reboot. lctl set_param is only for temporary tunable setting. You can use lctl conf_param to set a permanent tunable. Would you mind to provide an example line? I never understood the logic of lctl conf_param. This fails: lctl conf_param ldlm.namespaces.testfs-OST0002-osc.lru_size=800 error: conf_param: Invalid argument And this as well lctl conf_param testfs-MDT.ldlm.namespaces.testfs-OST0002-osc.lru_size=800 Thanks, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Setup mail cluster
On Monday 12 October 2009, Michael Schwartzkopff wrote: Am Montag, 12. Oktober 2009 15:54:04 schrieb Vadym: Hello I'm do a schema of mail service so I have only one question: Can Lustre provide me full automatic failover solution? No. See the lustre manual for this. You need a cluster solution for this. The manual is *hopelessly* outdate at this point. Do NOT user heartbeat any more. User pacemaker as the cluster manager. See www.clusterlabs.org. When I find some time I want to write a HOWTO about setting up a lustre clsuter with pacemaker and OpenAIS. Also see bug 20807 (https://bugzilla.lustre.org/show_bug.cgi?id=20807) for a pacemaker agent. Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Moving MGS to separate device
Hello Wojciech, I already did this several times, here are the steps I so far used: 1) Remove MGS from MDT-device tunefs.lustre --nogms /dec/mdt_device 2) Create new MGS mkfs.lustre --mgs /dev/mgs_device 3) Make sure OSTs and MDTs re-register with the MGS: tunefs.lustre --writeconf /dev/device I'm not sure if writeconf is really necessary, but so far I always did it to make sure everything goes smoothly (clients shouldn't have the filesystem mounted at this step). 5) Mount MGS, MDT, OSTs 4) Re-apply settings done with lctl. As you also wrote (private mail), it might be possible to just copy over the CONFIGS directory, but I never tried to do that. Hope it helps, Bernd On Saturday 10 October 2009, Wojciech Turek wrote: Hi, I am very interested in finding out how to move co-located MGS to separate disk. I will be moving my MDTs to new hardware soon and I would like to separate MGS from MDT. I will be grateful for some info on this subject please. Many thanks, Wojciech 2008/6/23 Andreas Dilger adil...@sun.com On Jun 17, 2008 12:40 -0700, Klaus Steden wrote: I have a question ... if the MGS is used so infrequently relative to the use of the MDS, why is it (is it?) problematic to locate it on the same volume as the MDT? If you have multiple MDTs on the same MDS node (i.e. multiple Lustre filesystems) then it is difficult to start up the MGS separately from the MDT if it is co-located with one of the MDTs. It isn't impossible (with some manual mounting of the underlying filesystems) to move a co-located MGS to a separate filesystem if needed. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Client complaining about duplicate inode entry after luster recovery
Hello Wojciech, bug 17485 has patch, that has landed in 1.8 to prevent that duplicate references to OST objects come up after MDS fail over. But if you create duplicate entries yourself, it won't help of course. Bug 20412 has such a valid use case for duplicate MDT files and also lots of patches for lfsck, since the default way to fix such issues wasn't suitable for us. Hmm, but 18748 came up, when I tested at CIEMAT lustre-1.6.7 + patch from bug 17485 and somehow filesystem corruption came up. I'm still not sure what was the main culprit for the corruption - either the initial patch of bug 17485 or the MDS issue with 1.6.7. Unfortunately I still didn't get a test system with at least 100 clients to reproduce the test. So in principle you shouldn't run into it, at least not with corrupted objects. I guess it will be fixed, once you fix the filesystem with e2fsck and lfsck. I'm only surprised that vanilla 1.6.6 works for you, it has so many bugs... Cheers, Bernd On Sunday 11 October 2009, Wojciech Turek wrote: Hi Bernd, Many thanks for your reply. I have found this bug last night and as far as I can see there is no fix for it yet? I am preparing dbs to run lfsck on affected file systems. I also found bug 18748 and I must say we have exactly the same problems. It just looks like we run into that problem few months after CIEMAT did. As far as I know if we can see this message it means that there are files with missing objects. The worst is that we don't know when and why files looses they objects. It just happens spontaneously and there isn't any lustre messages that could give us a clue. Users run jobs and some time after their files were written some of these files get corrupted/looses objects (?-) trying to access this files for the first time triggers 'lvbo' message. We have third lustre file system which runs on different hardware but the same lustre version and RHEL version as the affected ones. I can not see any problems on the third file system. Wojciech 2009/10/10 Bernd Schubert bs_li...@aakef.fastmail.fm ASSERTION(old_inode-i_state I_FREEING) is the infamous bug17485. You will need to run lfsck to fix it. On Saturday 10 October 2009, Wojciech Turek wrote: Hi, Did you get to the bottom of this? We are having exactly the same problem with our lustre-1.6.6 (rhel4) file systems. Recently it got worst and MDS crashes quite frequently, when we run e2fsck there are errors that are being fixed. However after some time we still are seeing the same errors in the logs about missing objects and files get corrupted (?---) Also clients LBUGs quite frequently with this message (osc_request.c:2904:osc_set_data_with_check()) LBUG This looks like serious lustre problem but so far I didn't find any clues on that even after long search through lustre bugzilla. Our MDSs and OSSs are UPSed, RAID is behaving OK, we don't see any errors in the syslog. I will be grateful for some hints on this one Wojciech 2009/8/24 rishi pathak mailmaverick...@gmail.com Hi, Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as well as servers run lustre-1.6 and kernel 2.6.9-18. Doing a ls -ltr for a directory in lustre fs throws following errors (as got from lustre logs) on client 0008:0002:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set _data_with_check()) ### inconsistent l_ast_data found ns: scratch-OST0005-osc-81201e8dd800 lock: 811f9af04 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type: EXT [0-18446744073709551615] (req 0-18446744073709551615) flags: 10 remote: 0xb79b445e381bc9e6 expref: -99 p id: 22878 0008:0004:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set _data_with_check()) ASSERTION(old_inode-i_state I_FREEING) failed:Found existing inode 811f2cf693b8/1972725 44/1895600178 state 0 in lock: setting data to 8118ef8ed5f8/207519777/1771835328 :0004:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set _data_with_check()) LBUG On scratch-OST0005 OST it shows Aug 24 10:22:53 yn266 kernel: LustreError: 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour ce 569204: rc -2 Aug 24 10:22:53 yn266 kernel: LustreError: 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous similar messages Aug 24 12:40:43 yn266 kernel: LustreError: 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour ce 569195: rc -2 Aug 24 12:44:59 yn266 kernel: LustreError: 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour ce 569198: rc -2 These kind of errors we are getting for many clients. ##History ## Prior
Re: [Lustre-discuss] Client complaining about duplicate inode entry after luster recovery
ASSERTION(old_inode-i_state I_FREEING) is the infamous bug17485. You will need to run lfsck to fix it. On Saturday 10 October 2009, Wojciech Turek wrote: Hi, Did you get to the bottom of this? We are having exactly the same problem with our lustre-1.6.6 (rhel4) file systems. Recently it got worst and MDS crashes quite frequently, when we run e2fsck there are errors that are being fixed. However after some time we still are seeing the same errors in the logs about missing objects and files get corrupted (?---) Also clients LBUGs quite frequently with this message (osc_request.c:2904:osc_set_data_with_check()) LBUG This looks like serious lustre problem but so far I didn't find any clues on that even after long search through lustre bugzilla. Our MDSs and OSSs are UPSed, RAID is behaving OK, we don't see any errors in the syslog. I will be grateful for some hints on this one Wojciech 2009/8/24 rishi pathak mailmaverick...@gmail.com Hi, Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as well as servers run lustre-1.6 and kernel 2.6.9-18. Doing a ls -ltr for a directory in lustre fs throws following errors (as got from lustre logs) on client 0008:0002:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set _data_with_check()) ### inconsistent l_ast_data found ns: scratch-OST0005-osc-81201e8dd800 lock: 811f9af04 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type: EXT [0-18446744073709551615] (req 0-18446744073709551615) flags: 10 remote: 0xb79b445e381bc9e6 expref: -99 p id: 22878 0008:0004:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set _data_with_check()) ASSERTION(old_inode-i_state I_FREEING) failed:Found existing inode 811f2cf693b8/1972725 44/1895600178 state 0 in lock: setting data to 8118ef8ed5f8/207519777/1771835328 :0004:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set _data_with_check()) LBUG On scratch-OST0005 OST it shows Aug 24 10:22:53 yn266 kernel: LustreError: 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour ce 569204: rc -2 Aug 24 10:22:53 yn266 kernel: LustreError: 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous similar messages Aug 24 12:40:43 yn266 kernel: LustreError: 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour ce 569195: rc -2 Aug 24 12:44:59 yn266 kernel: LustreError: 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour ce 569198: rc -2 These kind of errors we are getting for many clients. ##History ## Prior to thsese occurences, our MDS showed signs of failure in way that cpu load was shooting above 100 (on a quad core quad socket system) and users were complaining about slow storage performance. We took it offline and did fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it showed some errors which were fixed. For data integrity check, mdsdb and ostdb were built and lfsck was run on a client(client was mounted with abort_recov). lfsck was run in following order: lfsck with no fix - reported dangling inodes and orphaned objects lfsck with -l (backup orphaned objects) lfsck with -d and -c (delete orphaned objects and create missing OST objects referenced by MDS) After above operations, on clients we were seeing file in red and blinking. Doing a stat came out with an error stating 'no such file or directory'. My question is whether the order in which lfsck was run (should lfsck be run multiple times) and the errors we are getting are related or not. -- Regards-- Rishi Pathak National PARAM Supercomputing Facility Center for Development of Advanced Computing(C-DAC) Pune University Campus,Ganesh Khind Road Pune-Maharastra ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?
On Friday 09 October 2009, Lundgren, Andrew wrote: Is there a way to set the lru_size to a fixed value and have it stay that way across mounts? I know it can be set using: $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100)) But that isn't retained across a reboot. Even worse, if for some reason, e.g. evictions connection to OSTs get lost, it will also also reset to default. We are for now compiling our packages LRU disabled. -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] 1.8 download link outdated
Hello this link still points to the alpha version, I guess it better should be redirected as v1.6: http://downloads.lustre.org/public/lustre/v1.8/ Cheers, Bernd -- Bernd Schubert DataDirect Networks ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mds server crashing
Hello Mag, sorry for my late reply. I think there is a misunderstanding. The bug I'm talking about is if you export Lustre by knfsd. It is not important if you do use any other NFS services on your MDS/OSS system. But if you should export Lustre by NFS using the kernel export nfs daemon, try to disable that. Cheers, Bernd On Sunday 15 March 2009, Mag Gam wrote: This happened again :-( Basically, there is a process called ll_mdt30 which is taking up 100% of the CPU. I am not sure what its doing but I can't even reboot the system. I have to hard reboot. Also, I checked my other OSTs and MDS and I don't have anything special for NFS in /etc/modules.conf On Sat, Mar 14, 2009 at 8:35 AM, Mag Gam magaw...@gmail.com wrote: Hey Bernd: Thanks for the reply. Interesting. We are using with NFS too. Is there something in particular we need to do like enable port 988 in /etc/modules.conf which I think I am already doing. Any chance you can send traces with line wrap disabled? With line wrapping it is quite hard to read. Ofcourse! I even posted a bug report with the /tmp/lustre.log https://bugzilla.lustre.org/show_bug.cgi?id=18802 Let me know if you need anything else. TIA On Sat, Mar 14, 2009 at 7:35 AM, Bernd Schubert bernd.schub...@fastmail.fm wrote: On Saturday 14 March 2009, Mag Gam wrote: We are having a problem with a MDS server (which also has 1 OST) on the box. When the server boots up, we notice there is an ll_mdt process running at 100% and we keep on waiting close to 10-15 mins. We only have 8 clients. (I assume this normal recovery process). However if I manually mount up the mdt without any recovery everything is fine Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be fixed in 1.6.5. Although I'm not sure, since we switched NFS exports to unfs3 ever since the problem came up. Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 12 10:11:02 protected_host_01 kernel: RIP: 0010:[888ed8df] [888ed8df] :ldiskfs:do_split+0x3ef/0x560 Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:8103d2a5f460 EFLAGS: 0216 Mar 12 10:11:02 protected_host_01 kernel: RAX: RBX: 0080 RCX: Mar 12 10:11:02 protected_host_01 kernel: RDX: 0080 RSI: 8103cd52177c RDI: 8103cd52176c Any chance you can send traces with line wrap disabled? With line wrapping it is quite hard to read. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] mds server crashing
On Saturday 14 March 2009, Mag Gam wrote: We are having a problem with a MDS server (which also has 1 OST) on the box. When the server boots up, we notice there is an ll_mdt process running at 100% and we keep on waiting close to 10-15 mins. We only have 8 clients. (I assume this normal recovery process). However if I manually mount up the mdt without any recovery everything is fine Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be fixed in 1.6.5. Although I'm not sure, since we switched NFS exports to unfs3 ever since the problem came up. Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10 Tainted: G 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 Mar 12 10:11:02 protected_host_01 kernel: RIP: 0010:[888ed8df] [888ed8df] :ldiskfs:do_split+0x3ef/0x560 Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:8103d2a5f460 EFLAGS: 0216 Mar 12 10:11:02 protected_host_01 kernel: RAX: RBX: 0080 RCX: Mar 12 10:11:02 protected_host_01 kernel: RDX: 0080 RSI: 8103cd52177c RDI: 8103cd52176c Any chance you can send traces with line wrap disabled? With line wrapping it is quite hard to read. Cheers, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] source download doesn't work
Hello, ever since the end of last week I try to download the sources of 1.6.7, but I always get: We are sorry ... General Error We are sorry, but the download system cannot process your request at this time. Please try again later. If the problem persists, please report it to Customer Service. I'm going to report to the Customer Service now and also try to find the proper cvs branch. Thanks, Bernd ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss