from:"Bernd Schubert"

Re: [lustre-discuss] (LFSCK) LBUG: ASSERTION( get_current()->journal_info == ((void *)0) ) failed

2016-09-14 Thread Bernd Schubert

Hi Cédric,

I'm by no means familiar with Lustre code anymore, but based on the stack 
trace and function names, it seems to be a problem with the journal. Maybe try 
to do an 'efsck -f' which would replay the journal and possibly clean up the 
file it has problem with.


Cheers,
Bernd


On Wednesday, September 14, 2016 9:28:38 AM CEST Cédric Dufour - Idiap 
Research Institute wrote:
> Hello,
> 
> Last Friday, during normal operations, our MDS froze with the following
> LBUG, which happens again as soon as one mounts the MDT again:
> 
> Sep 13 15:10:28 n00a kernel: [ 8414.600584] LustreError:
> 11696:0:(osd_handler.c:936:osd_trans_start()) ASSERTION(
> get_current()->journal_info == ((void *)0) ) failed: Sep 13 15:10:28
> n00a kernel: [ 8414.612825] LustreError:
> 11696:0:(osd_handler.c:936:osd_trans_start()) LBUG
> Sep 13 15:10:28 n00a kernel: [ 8414.619833] Pid: 11696, comm: lfsck
> Sep 13 15:10:28 n00a kernel: [ 8414.619835] Sep 13 15:10:28 n00a kernel:
> [ 8414.619835] Call Trace:
> Sep 13 15:10:28 n00a kernel: [ 8414.619850]  []
> libcfs_debug_dumpstack+0x52/0x80 [libcfs]
> Sep 13 15:10:28 n00a kernel: [ 8414.619857]  []
> lbug_with_loc+0x42/0xa0 [libcfs]
> Sep 13 15:10:28 n00a kernel: [ 8414.619864]  []
> osd_trans_start+0x250/0x630 [osd_ldiskfs]
> Sep 13 15:10:28 n00a kernel: [ 8414.619870]  [] ?
> osd_declare_xattr_set+0x58/0x230 [osd_ldiskfs]
> Sep 13 15:10:28 n00a kernel: [ 8414.619876]  []
> lod_trans_start+0x177/0x200 [lod]
> Sep 13 15:10:28 n00a kernel: [ 8414.619881]  []
> lfsck_namespace_double_scan+0x1122/0x1e50 [lfsck]
> Sep 13 15:10:28 n00a kernel: [ 8414.619888]  [] ?
> thread_return+0x3e/0x10c
> Sep 13 15:10:28 n00a kernel: [ 8414.619894]  [] ?
> enqueue_task_fair+0x58/0x5d
> Sep 13 15:10:28 n00a kernel: [ 8414.619899]  []
> lfsck_double_scan+0x5a/0x70 [lfsck]
> Sep 13 15:10:28 n00a kernel: [ 8414.619904]  []
> lfsck_master_engine+0x50d/0x650 [lfsck]
> Sep 13 15:10:28 n00a kernel: [ 8414.619909]  [] ?
> lfsck_master_engine+0x0/0x650 [lfsck]
> Sep 13 15:10:28 n00a kernel: [ 8414.619915]  []
> kthread+0x7b/0x83
> Sep 13 15:10:28 n00a kernel: [ 8414.619918]  [] ?
> finish_task_switch+0x48/0xb9
> Sep 13 15:10:28 n00a kernel: [ 8414.619924]  []
> child_rip+0xa/0x20
> Sep 13 15:10:28 n00a kernel: [ 8414.619928]  [] ?
> kthread+0x0/0x83
> Sep 13 15:10:28 n00a kernel: [ 8414.619931]  [] ?
> child_rip+0x0/0x20
> 
> 
> I originally had the LFSCK launched in "dry-run" mode:
> 
> lctl lfsck_start --device lustre-1-MDT --dryrun on --type namespace
> 
> The LFSCK was reported completed (I was 'watch[ing] -n 1' on a terminal)
> before the LBUG popped-up; now, I can't even get any output
> 
> cat /proc/fs/lustre/mdd/lustre-1-MDT/lfsck_namespace  # just hang
> there indefinitely
> 
> 
> I remember seing a lfsck_namespace file in the MDT underlyding LDISKFS;
> is there anything sensible I can do with it (e.g. would deleting it
> solve the situation) ?
> What else could I do ?
> 
> 
> Thanks for your answers and best regards,
> 
> Cédric D.
> 
> 
> PS: I had this message originally posted on HPDD-discuss mailing list
> and just realized it was the wrong place; sorry for any crossposting case
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [Lustre-discuss] Patchless kernel support?

2010-11-25 Thread Bernd Schubert

Hi all,

I think Ashley means patchless server support. That is already tracked in 
bug#21524.

Ashley, while patchless server suppoert certainly is a good idea, it might not 
always be as helpful as you believe. Updating the presently existing patches 
is usually rather straight forward. Far more difficult is if the VFS changes 
and new methods and configure checks have to be implemented in Lustre. That 
made it so difficult to update Lustre to 2.6.24 and now again the limit had 
been 2.6.32 (maybe the VFS changes already had been in before, but I didn't 
track linux-git recently that much). And those changes in the VFS are often 
also completely unrelated to kernel patches...

I also planned to work on the sd_io stats. But I think that patch simply 
should be dropped in favour of blktrace. Current blkiomon does almost the same 
as the sd_io stats, but IMHO neither of both approaches is really helpful. So 
I have a modified blkiomon version (not ready for patch submission yet), that 
does similar stats as the DDN S2A controllers and IMHO only those detailed 
stats are really helpful to analyze IO patterns. If it comes to me, neither 
sd_io stats, nor DDN SFA nor upstream blkiomon have sufficiently detailed 
information to see where the problem is.
I understand that blktrace has some overhead compared to sd_io stats. However, 
if sd_io stats is supposed to ever land upstream, it needs to be rewritten 
from procfs to debugfs. I think even sysfs is not suitable for it.

Cheers,
Bernd

On Thursday, November 25, 2010, Alexey Lyashkov wrote:
 Ashley,
 
 I don't clearry understand what you want, if you say about patchless
 support on client - typical size of adding support of one new kernel to
 pachless client is ~40kb of patch for lustre. Sometimes is has more work,
 sometimes less.
 As last lustre supported kernel is 2.6.32 - you should be plan to have
 ~150kb patch for 2.6.37 kernel support. if you say about patchless kernel
 support - yes, that is possible, but that is need more work and submiting
 lots patches in kernel upstream.
 
 On Nov 25, 2010, at 15:18, Ashley Pittman wrote:
  Picking up from something that was said at SC last week I believe it was
  Andreas that mentioned the possibility of patch-less kernel support. 
  This is something that would be immensely useful to us for a variety of
  reasons.
  
  Has there been any recent work into investigating how much work would be
  involved in implementing this and what's the feeling for if it could be
  done though changes to Lustre only or a case of submitting a number of
  patches upstream?
  
  Ashley.
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] NFS problem after upgrade to 1.8.3

2010-11-12 Thread Bernd Schubert

Hello Tina,

On Friday, November 12, 2010, Tina Friedrich wrote:
 Hello List,
 
 we re-export the file system via NFS for a couple of things. All the
 re-exporters are Red Hat 5.5 servers running kernel 2.6.18-194.17.1.el5
 (patchless clients).

that is your problem. You MUST use a patched version, or least a kernel with 
8kB stack size. RHEL5 has 4kB by default, which is not sufficient and 
therefore in early 1.8 versions a patch landed that disallowed NFS exports.


Cheers,
Bernd



-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] NFS problem after upgrade to 1.8.3

2010-11-12 Thread Bernd Schubert

Hello Tina,

On 11/12/2010 03:44 PM, Tina Friedrich wrote:
 Hello again,
 
 nope, running with / exporting from a server with the patched kernel 
 running does not change this behaviour at all. mountvers=3 works, 1 and 
 2 don't.

I can reproduce it, so NFSv2 support got broken. Which issue has higher
priority, tar or NFSv2?


Cheers,
Bernd



signature.asc
Description: OpenPGP digital signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Serious error: objid already exists; is this filesystem corrupt?

2010-11-04 Thread Bernd Schubert

Hello Christopher, hello Alex,

the alternative is to let e2fsck correct LAST_ID. Patches are here:

https://bugzilla.lustre.org/show_bug.cgi?id=22734

and included in our e2fsprogs releases:

http://eu.ddn.com:8080/lustre/lustre/RHEL5/tools/e2fsprogs/

Unfortunately, the patches are not yet in Oracle e2fsprogs version.

In order to let e2fsck correct it, you will need to create an mdsdb file (the 
hdr is sufficient) and then 
e2fsck --mdsdb mdsdb.hdr --ostdb some_irrelevant_file  /dev/device

The procedure is similar to the lfsck preparations, although one usually runs 
that with -n. To let e2fsck (pass6, the db-part) correct the LAST_ID, it 
must *not* run in read-only mode, though.


Cheers,
Bernd


On Thursday, November 04, 2010, Alexey Lyashkov wrote:
 Hi Christopher,
 
 you need kill lov_objid file on MDS and set LAST_ID on OST to  870397.
 in that case MDS will reread last_id from OST's and refill lov_objid file,
 to avoid possible file corruption.
 
 On Nov 4, 2010, at 04:22, Christopher Walker wrote:
  We recently had a hardware failure on one of our OSTs, which has caused
  some major problems for our 1.6.6-based array.
  
  We're now getting the error:
  
  Serious error: objid 517386 already exists; is this filesystem corrupt?
  
  on one of our OSTs.  If I mount this OST as ldiskfs and look in O/0/d*,
  the highest objid I see is 870397, considerably higher than 517386.
  We've taken this OST through a round of e2fsck
  and ll_recover_lost_found_objs, during which it restored a lot of lost
  files, and e2fsck on this OST and on the MDT don't currently show any
  problems.  Can I simply edit O/0/LAST_ID, set it to 870397, and expect
  files with objid between 517386 and 870397 to come back?
  
  Also, I could be wrong, but it looks like ll_recover_lost_found_objs.c
  only looks for lost files up to LAST_ID -- if I reset LAST_ID to 870397,
  should I rerun ll_recover_lost_found_objs?
  
  Many thanks in advance,
  Chris
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recovering formatted OST

2010-10-26 Thread Bernd Schubert

Hello Wojciech,

I think both would work, but why don't just create a small OST with 
mkfs.lustre on a loopback device? And then copy over those files to your 
recovered filesystem. 
Hmm, well, e2fsck might not have fixed all issues and then a reformat indeed 
might be helpful.

Also note: EAs on OST objects are a nice to have, but not absolutely required. 

Cheers,
Bernd


On Tuesday, October 26, 2010, Wojciech Turek wrote:
 Bernd, I would like to clarify if I understood you suggestion correctly:
 
 1) create a new OST but using old index and old label
 2) mount it as ldiskfs and copy recovered objects (using tar or rsync with
 xattrs support) from the old OST to the new OST
 3) run --writeconf on MDT and OST of that filesystem
 4) mount MDT and all OSTs
 
 
 I guess I could do it also that way:
 
 1) backup restored object using tar or rsync with xattrs support
 2) format old OST with old index and old label
 3) restore Objects from the backup
 
 Do you think that would work?
 
 Best regards,
 
 Wojciech
 
 On 22 October 2010 18:52, Bernd Schubert bernd.schub...@fastmail.fm wrote:
  Hmm, I would probably format a small fake device on a ramdisk and copy
  files
  over, run tunefs --writeconf /mdt and then start everything (inlcuding
  all OSTs) again.
  
  
  Cheers,
  
  On Friday, October 22, 2010, Wojciech Turek wrote:
   I have tried Bernd's suggestion and it seem to have worked, after
   running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic
   but moved
  
  a
  
   number of objects to O directory. Problem is that I do not have
   last_rcvd file so the OST has no index at the moment. What would be
   the next step
  
  to
  
   enable access to those files in the filesystem?
   
   Best regards,
   
   Wojciech
   
   On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com
  
  wrote:
On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm
  
  wrote:
 Hmm, e2fsck didn't catch that? rec_len is the length of a directory

entry, so

 after how many bytes the next entry follows.

I agree that e2fsck should have caught that.

 You can try to force e2fsck to do
 something about that: e2fsck -D

No, I would recommend against using -D at this point. That will cause
  
  it
  
to re-write the directory contents, and given that the filesystem was
previously corrupted I would prefer making as few changes as possible
before the data is estranged.

Wojciech,
note that if you are able to mount the filesystem you could just copy
  
  all
  
of the objects (with xattrs!) from lost+found on the bad filesystem,
along with the last_rcvd file (if you can find it) into a new ldiskfs
filesystem and then run ll_recover_lost_found_objs on that.

 On Friday, October 22, 2010, Wojciech Turek wrote:
 Ok, removing and recreating the journal fixed that problem and I
 am able

to

 mount device as ldiskfs filesystem. Now I hit another wall when
  
  trying
  
to

 run ll_recover_lost_found_objs
 When I first time run ll_recover_lost_found_objs -d
 /mnt/ost/lost+found

it

 only creates the O dir and exits. When I repeat this command again

kernel

 panics. Any idea what could be the problem here?
 
 
 LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in
 directory #6831: rec_len is smaller than minimal - offset=0,
  
  inode=0,
  
 rec_len=0, name_len=0
 Aborting journal on device dm-4.
 Unable to handle kernel NULL pointer dereference at
 

RIP:
 [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db
 PGD 1a118d067 PUD 1ce7e7067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file: /class/infiniband_mad/umad0/port
 CPU 3
 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U)
 bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U)
 ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U)
 crypto_api(U)

ib_uverbs(U)

 ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U)

mptctl(U)

 dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U)

i2c_ec(U)

 i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U)
 acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U)

cdrom(U)

 mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U)
  
  usb_storage(U)
  
 pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U)
  
  dm_raid45(U)
  
 dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U)
 dm_mem_cache(U)

nfs(U)

 lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U)

mptbase(U)

 scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U)
 sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U)
 ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm

Re: [Lustre-discuss] recovering formatted OST

2010-10-26 Thread Bernd Schubert

On Tuesday, October 26, 2010, Wojciech Turek wrote:
 Hi,
 
 There is a LAST_ID file on the OST and indeed it equals a highest object
 number
 
 [r...@oss09 ~]# od -Ax -td8 /tmp/LAST_ID
 00  2490599
 08
 
 [r...@oss09 ~]# ls -1s /mnt/ost/O/0/d* | grep -v [a-z] | sort -k2 -n | tail
 -1
   8 2490599
 
 However MDS seem to think differently.
 
 r...@mds03 ~]# lctl get_param osc.*.prealloc_last_id | grep OST0010
 osc.scratch2-OST0010-osc.prealloc_last_id=1

Yeah.

 
 Is this caused by deactivating the OST on the MDS? I have deactivated  OST
 on  MDS using this command:
 
 lctl --device 19 conf_param scratch2-OST0010.osc.active=0
 
 I looked into lov_objid reported by the MDS but I am not sure how to
 interpret the output correctly
 [r...@mds03 ~]# od -Ax -td8 /tmp/lov_objid
 00  2073842  2100049
 10  2115247  2038471
 20  2119821  2190996
 30  2029234  2354424
 40  2160856  2167105
 50  1970351  2059045
 60  2706486  2571655
 70  2662262  2628346
 80  2490688  2668926
 90  2631587  2643791
 a0
 
 So my question is how I can find out if my LAST_ID is fine?

Above you deactivated OST0010 (hex), so OST-16 in decimal (counting starts 
with zero). That should be 2490688 then.

I still wonder if we could convince e2fsck to set that last_id value on the 
OST itself. It already can correct the wrong last_id value, but it sets that 
to the last_id it finds on disk 
(https://bugzilla.lustre.org/show_bug.cgi?id=22734). Setting it to the MDS 
value should also work, but firstly for sanity reasons it falls back to the on 
disk value, if the values differ too much (1) and secondly I figured out 
with those patches there, that using the MDS value is broken (and did not get 
broken by patches, but my patches revealed it...). 

Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recovering formatted OST

2010-10-26 Thread Bernd Schubert

Hello Lisa,

OST-index and the fsname identify the OST for the MGS, MGS and clients. 

If you reformat an OST and you do not re-use the old index, it will leave a 
hole, as the new OST gets another index. And OST holes are an uncommon 
scenario, it often triggers some bugs...


Cheers,
Bernd

On Tuesday, October 26, 2010, Lisa Giacchetti wrote:
 Wojciech,
   since you  have successfully done step #4 can you tell me what is use
 in the reformat for the old index id?
   I tried to do this a few weeks ago was not succsessful at reformatting
 an ost with the old index because
   I am not clear on what the index is.  I asked on this list at that
 time for input and did not get much.
   If you could provide the exact command you used that would be good too.
 
 lisa
 
 On 10/26/10 10:31 AM, Wojciech Turek wrote:
  Since some of our users started to recover their data from backups or
  by other means (rerunning jobs etc) into the original locations I
  don't think it would be good idea to put the recovered OST back in
  service as it is, as that may cause some of users new files to be
  overwritten by the recovered files.
  
  To avoid that scenario I decided to reformat the old OST and put it
  back into filesystem as empty.
  1) First I have created a backup of the recovered object files
  2) then using lfs find and lfs getstripe on the client  I created a
  list of files and object ids from the formatted OST
  3) using backup from point 1 and information from point 2 I copied
  objects to a new location on the filesystem and renamed them to their
  original name. Now users can interrogate those files and choose which
  they want to keep.
  4) I reformatted old OST with old index id and old label
  
  
  Before I mount that OST into filesystem I want to make sure that MDS
  detects it as empty OST and does not try to recreate missing objects.
  Would it be enough to remove lov_objid from MDT and let it create new
  lov_objid based on information from OSTs, or do I need to first unlink
  all missing files from the client?
  
  Best regards,
  
  Wojciech
  
  On 26 October 2010 05:36, Wojciech Turek wj...@cam.ac.uk
  
  mailto:wj...@cam.ac.uk wrote:
  Bernd, I would like to clarify if I understood you suggestion
  correctly:
  
  1) create a new OST but using old index and old label
  2) mount it as ldiskfs and copy recovered objects (using tar or
  rsync with xattrs support) from the old OST to the new OST
  3) run --writeconf on MDT and OST of that filesystem
  4) mount MDT and all OSTs
  
  
  I guess I could do it also that way:
  
  1) backup restored object using tar or rsync with xattrs support
  2) format old OST with old index and old label
  3) restore Objects from the backup
  
  Do you think that would work?
  
  Best regards,
  
  Wojciech
  
  
  
  On 22 October 2010 18:52, Bernd Schubert
  bernd.schub...@fastmail.fm mailto:bernd.schub...@fastmail.fm
  
  wrote:
  Hmm, I would probably format a small fake device on a ramdisk
  and copy files
  over, run tunefs --writeconf /mdt and then start everything
  (inlcuding all
  OSTs) again.
  
  
  Cheers,
  
  On Friday, October 22, 2010, Wojciech Turek wrote:
   I have tried Bernd's suggestion and it seem to have worked,
  
  after running
  
   e2fsck -D ll_recover_lost_found_objs didn't cause kernel
  
  panic but moved a
  
   number of objects to O directory. Problem is that I do not
  
  have last_rcvd
  
   file so the OST has no index at the moment. What would be
  
  the next step to
  
   enable access to those files in the filesystem?
   
   Best regards,
   
   Wojciech
   
   On 22 October 2010 17:15, Andreas Dilger
  
  andreas.dil...@oracle.com mailto:andreas.dil...@oracle.com
  
  wrote:
On 2010-10-22, at 5:42, Bernd Schubert
  
  bernd.schub...@fastmail.fm
  
  mailto:bernd.schub...@fastmail.fm wrote:
 Hmm, e2fsck didn't catch that? rec_len is the length of
  
  a directory
  
entry, so

 after how many bytes the next entry follows.

I agree that e2fsck should have caught that.

 You can try to force e2fsck to do
 something about that: e2fsck -D

No, I would recommend against using -D at this point. That
  
  will cause it
  
to re-write the directory contents, and given that the
  
  filesystem was
  
previously corrupted I would prefer making as few changes

Re: [Lustre-discuss] 1.8 quotas

2010-10-23 Thread Bernd Schubert

Hello Jason,

please note that it is also possible to enable quotas using lctl and that 
would not be visible using tunefs.lustre. I think the only real option to 
check if quotas are enabled is to check if quota file exist. For an online 
filesystem 'debugfs -c /dev/device' is probably the safest way (there is also 
a 'secret' way how to bind mount the underlying ldiskfs to another directory, 
but I only use that for test filesystems and never in production, as have not 
verified the kernel code path yet).

Either way, you should check for lquota files, such as

r...@rhel5-nfs@phys-oss0:~# mount -t ldiskfs /dev/mapper/ost_demofs_2 /mnt

r...@rhel5-nfs@phys-oss0:~# ll /mnt
[...]
-rw-r--r-- 1 root root  7168 Oct 23 09:48 lquota_v2.group
-rw-r--r-- 1 root root 71680 Oct 23 09:48 lquota_v2.user


(Of course, you should check that for those OST which have reported the slow 
quota messages).

I just poked around a bit in the code and above the fsfilt_check_slow() check, 
there is also a loop that calls filter_range_is_mapped(). Now this function 
calls fs_bmap() and when that eventually goes to down to ext3, it might get a 
bit slow if, if another thread should modify that file (check out 
linux/fs/inode.c):

/* 
 * bmap() is special.  It gets used by applications such as lilo and by
 * the swapper to find the on-disk block of a specific piece of data.
 *
 * Naturally, this is dangerous if the block concerned is still in the
 * journal.  If somebody makes a swapfile on an ext3 data-journaling
 * filesystem and enables swap, then they may get a nasty shock when the
 * data getting swapped to that swapfile suddenly gets overwritten by
 * the original zero's written out previously to the journal and
 * awaiting writeback in the kernel's buffer cache. 
 *
 * So, if we see any bmap calls here on a modified, data-journaled file,
 * take extra steps to flush any blocks which might be in the cache. 
 */

I don't know though, if it can happen that several threads write to the same 
file. But if it happens, it gets slow. I wonder if a possible swap file is 
worth the  efforts here... In fact, the reason to call 
filter_range_is_mapped() certainly does not require a journal flush in that 
loop. I will check myself next week, if journal flushes are ever made due to 
that and open a Lustre bugzilla then. Avoiding all of that should not be 
difficult

Cheers,
Bernd





On Saturday, October 23, 2010, Jason Hill wrote:
 Kevin/Dave/(and Dave from DDN):
 
 Thanks for your replies. From tunefs.lustre --dryrun it is very apparent
 that we are not running quotas.
 
 Thanks for your assistance.
 
  That message, from lustre/obdfilter/filter_io_26.c, is the result of the
  thread taking 35 second
  from when it entered filter_commitrw_write() until after it called
  lquota_chkquota() to check the quota.
  
  However, it is certainly plausible that the thread was delayed because
  of something other than quotas,
  such as an allocation (eg, it could have been stuck in filter_iobuf_get).
  
  Kevin
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-23 Thread Bernd Schubert

Hello Michael,

On Saturday, October 23, 2010, Michael Kluge wrote:
 Hi Bernd,
 
 I get the same message with you kernel RPMS:
 
 In file included from include/linux/list.h:6,
   from include/linux/mutex.h:13,
   from
 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:36
 :
 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC
 6/include/linux/stddef.h:9: error: redeclaration of enumerator 'false'
 include/linux/stddef.h:16: error: previous definition of 'false' was here
 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6
 /include/linux/stddef.h:11: error: redeclaration of enumerator 'true'
 include/linux/stddef.h:18: error: previous definition of 'true' was here
 
 Could it be that this '2.6.18 being almost an 2.6.28/29' confuses the
 OFED backports and the 2.6.18 backport does not work anymore? Is that
 solvable? I found nothing in the OFED bugzilla.

somewhere there is a support matrix, which OFED version supports which RHEL 
version, but I also would need to search for it. Anyway, ofed-1.4 is already 
included in 2.6.18-164. So no need for any additional compilations. 2.6.18-194 
(RHEL5.5) also still mostly has OFED-1.4, but with an important mellanox 
driver backport (you will still additionally need a beta version version to 
get reliably QDR with recent chips). So if you have mellanox QDR HCAs and your 
connection is flaky in between SDR and QDR, just compile OFED-1.5, if works 
fine with Lustre (fortunately recently no interfaces changes anymore). But 
still make sure you compile Lustre against that stack...


I also just updated our download page a bit and also uploaded sources for 
kernel, lustre, tar and e2fsprogs.


Cheers,
Bernd

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-22 Thread Bernd Schubert

On Friday, October 22, 2010, Michael Kluge wrote:
 Hi list,
 
 DID_BUS_BUSY means that the controller is unable to handle the SCSI
 command and is basically asking the host to send it again later. I had I
 think just one concurrent region and 32 threads running. What would be
 the appropriate action in this case? Reducing the queue depth on the
 HBA? We have Qlogic here, there is an option for the kernel module for
 this.

I think you run into a known issue with the Q-Logic driver an the SFA10K. You 
will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal numbers 
of commands is likely to be 16 (with 4 OSS connected).


Hope it helps,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recovering formatted OST

2010-10-22 Thread Bernd Schubert

Hmm, e2fsck didn't catch that? rec_len is the length of a directory entry, so 
after how many bytes the next entry follows. You can try to force e2fsck to do 
something about that: e2fsck -D


Cheers,
Bernd


On Friday, October 22, 2010, Wojciech Turek wrote:
 Ok, removing and recreating the journal fixed that problem and I am able to
 mount device as ldiskfs filesystem. Now I hit another wall when trying to
 run ll_recover_lost_found_objs
 When I first time run ll_recover_lost_found_objs -d /mnt/ost/lost+found it
 only creates the O dir and exits. When I repeat this command again kernel
 panics. Any idea what could be the problem here?
 
 
 LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in directory
 #6831: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0,
 name_len=0
 Aborting journal on device dm-4.
 Unable to handle kernel NULL pointer dereference at  RIP:
  [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db
 PGD 1a118d067 PUD 1ce7e7067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file: /class/infiniband_mad/umad0/port
 CPU 3
 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U)
 bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U)
 ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) ib_uverbs(U)
 ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U) mptctl(U)
 dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U) i2c_ec(U)
 i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U)
 acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U)
 mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U) usb_storage(U)
 pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_raid45(U)
 dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) nfs(U)
 lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U) mptbase(U)
 scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U)
 sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U)
 ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G 
 2.6.18-194.3.1.el5_lustre.1.8.4 #1
 RIP: 0010:[88033448]  [88033448]
 
 :jbd:journal_commit_transaction+0xc5b/0x12db
 
 RSP: 0018:8101c6481d90  EFLAGS: 00010246
 RAX:  RBX:  RCX: 
 RDX:  RSI: 8101e9dab0c0 RDI: 81022fa46000
 RBP: 81022fa46000 R08: 81022fa46068 R09: 
 R10: 810105925b20 R11: fffa R12: 
 R13:  R14: 8101e9dab0c0 R15: 
 FS:  () GS:810107b9a4c0()
 knlGS: CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
 CR2:  CR3: 0001eaffb000 CR4: 06e0
 Process kjournald (pid: 11360, threadinfo 8101c648, task
 81021c14c0c0)
 Stack:  8101a61b9000 2b8263c0  
  113b0001 0013  0111
    01282dd7 20dd
 Call Trace:
  [8003da91] lock_timer_base+0x1b/0x3c
  [8004b347] try_to_del_timer_sync+0x7f/0x88
  [88037386] :jbd:kjournald+0xc1/0x213
  [800a0ab2] autoremove_wake_function+0x0/0x2e
  [800a089a] keventd_create_kthread+0x0/0xc4
  [880372c5] :jbd:kjournald+0x0/0x213
  [800a089a] keventd_create_kthread+0x0/0xc4
  [80032890] kthread+0xfe/0x132
  [8005dfb1] child_rip+0xa/0x11
  [800a089a] keventd_create_kthread+0x0/0xc4
  [8014bcf4] deadline_queue_empty+0x0/0x23
  [80032792] kthread+0x0/0x132
  [8005dfa7] child_rip+0x0/0x11
 
 
 Code: f0 0f ba 33 01 e8 42 fc 02 f8 8b 03 a8 04 75 07 8b 43 58 85
 RIP  [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db
  RSP 8101c6481d90
 CR2: 
  0Kernel panic - not syncing: Fatal exception
 
 On 22 October 2010 03:09, Andreas Dilger andreas.dil...@oracle.com wrote:
  On 2010-10-21, at 18:44, Wojciech Turek wj...@cam.ac.uk wrote:
  
  fsck has finished and does not find any more errors to correct. However
  when I try to mount the device as ldiskfs kernel panics with following
  message:
  
  Assertion failure in cleanup_journal_tail() at fs/jbd/checkpoint.c:459:
  blocknr != 0
  
  
  Hmm, not sure, maybe your journal is broken?  You can delete it with
  tune2fs -O ^has_journal (maybe after running e2fsck again to clear the
  journal), then re-create it with tune2fs -j.
  
  --- [cut here ] - [please bite here ] -
  Kernel BUG at fs/jbd/checkpoint.c:459
  invalid opcode:  [1] SMP
  last sysfs file: /class/infiniband_mad/umad0/
  port
  CPU 2
  Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)
  ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U)
  ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) autofs4(U)
  hidp(U) l2cap(U)

Re: [Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K

2010-10-22 Thread Bernd Schubert

Hello Michael,

I'm sorry to hear that. Unfortunately, I really do not have the time to port 
this version to your kernel version.

I remember that you use Debian. But I guess you are still using a SLES kernel 
then? You could ask Suse about it, although I guess they only do care about 
SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 kernel should 
work (and besides its name, it is internally more or less a 2.6.29 to 2.6.32 
kernel). Later Debian and Ubuntu releases have a more recent udev, which 
requires at least 2.6.27.

You could also ask our support department, if they have any news for 2.6.27. 
I'm in Lustre engineering and as we only support RHEL5 right now, I so far did 
not care about other kernel versions too much.

If all doesn't help, you will need to set the queue depth to 1, but that will 
also impose a big performance hit :(


Cheers,
Bernd


On Friday, October 22, 2010, Michael Kluge wrote:
 Hi Bernd,
 
 I have found a RHEL-only release for this version. It does not compile
 on a 2.6.27 kernel :( I actually don't want to go back to 2.6.18 just to
 get a new driver.
 
 
 Michael
 
 Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert:
  On Friday, October 22, 2010, Michael Kluge wrote:
   Hi list,
   
   DID_BUS_BUSY means that the controller is unable to handle the SCSI
   command and is basically asking the host to send it again later. I had
   I think just one concurrent region and 32 threads running. What would
   be the appropriate action in this case? Reducing the queue depth on
   the HBA? We have Qlogic here, there is an option for the kernel module
   for this.
  
  I think you run into a known issue with the Q-Logic driver an the SFA10K.
  You will need at least qla2xxx version 8.03.01.06.05.06-k. And the
  optimal numbers of commands is likely to be 16 (with 4 OSS connected).
  
  
  Hope it helps,
  Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recovering formatted OST

2010-10-22 Thread Bernd Schubert

Hmm, I would probably format a small fake device on a ramdisk and copy files 
over, run tunefs --writeconf /mdt and then start everything (inlcuding all 
OSTs) again.


Cheers,

On Friday, October 22, 2010, Wojciech Turek wrote:
 I have tried Bernd's suggestion and it seem to have worked, after running
 e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic but moved a
 number of objects to O directory. Problem is that I do not have last_rcvd
 file so the OST has no index at the moment. What would be the next step to
 enable access to those files in the filesystem?
 
 Best regards,
 
 Wojciech
 
 On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com wrote:
  On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm wrote:
   Hmm, e2fsck didn't catch that? rec_len is the length of a directory
  
  entry, so
  
   after how many bytes the next entry follows.
  
  I agree that e2fsck should have caught that.
  
   You can try to force e2fsck to do
   something about that: e2fsck -D
  
  No, I would recommend against using -D at this point. That will cause it
  to re-write the directory contents, and given that the filesystem was
  previously corrupted I would prefer making as few changes as possible
  before the data is estranged.
  
  Wojciech,
  note that if you are able to mount the filesystem you could just copy all
  of the objects (with xattrs!) from lost+found on the bad filesystem,
  along with the last_rcvd file (if you can find it) into a new ldiskfs
  filesystem and then run ll_recover_lost_found_objs on that.
  
   On Friday, October 22, 2010, Wojciech Turek wrote:
   Ok, removing and recreating the journal fixed that problem and I am
   able
  
  to
  
   mount device as ldiskfs filesystem. Now I hit another wall when trying
  
  to
  
   run ll_recover_lost_found_objs
   When I first time run ll_recover_lost_found_objs -d
   /mnt/ost/lost+found
  
  it
  
   only creates the O dir and exits. When I repeat this command again
  
  kernel
  
   panics. Any idea what could be the problem here?
   
   
   LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in
   directory #6831: rec_len is smaller than minimal - offset=0, inode=0,
   rec_len=0, name_len=0
   Aborting journal on device dm-4.
   Unable to handle kernel NULL pointer dereference at 
  
  RIP:
   [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db
   PGD 1a118d067 PUD 1ce7e7067 PMD 0
   Oops: 0002 [1] SMP
   last sysfs file: /class/infiniband_mad/umad0/port
   CPU 3
   Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U)
   bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U)
   ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
  
  ib_uverbs(U)
  
   ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U)
  
  mptctl(U)
  
   dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U)
  
  i2c_ec(U)
  
   i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U)
   acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U)
  
  cdrom(U)
  
   mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U) usb_storage(U)
   pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U) dm_raid45(U)
   dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U)
  
  nfs(U)
  
   lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U)
  
  mptbase(U)
  
   scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U)
   sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U)
   ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G
   2.6.18-194.3.1.el5_lustre.1.8.4 #1
   RIP: 0010:[88033448]  [88033448]
   
   :jbd:journal_commit_transaction+0xc5b/0x12db
   
   RSP: 0018:8101c6481d90  EFLAGS: 00010246
   RAX:  RBX:  RCX: 
   RDX:  RSI: 8101e9dab0c0 RDI: 81022fa46000
   RBP: 81022fa46000 R08: 81022fa46068 R09: 
   R10: 810105925b20 R11: fffa R12: 
   R13:  R14: 8101e9dab0c0 R15: 
   FS:  () GS:810107b9a4c0()
   knlGS: CS:  0010 DS: 0018 ES: 0018 CR0:
   8005003b CR2:  CR3: 0001eaffb000 CR4:
   06e0 Process kjournald (pid: 11360, threadinfo
   8101c648, task 81021c14c0c0)
   Stack:  8101a61b9000 2b8263c0 
  
  
  
   113b0001 0013  0111
     01282dd7 20dd
   Call Trace:
   [8003da91] lock_timer_base+0x1b/0x3c
   [8004b347] try_to_del_timer_sync+0x7f/0x88
   [88037386] :jbd:kjournald+0xc1/0x213
   [800a0ab2] autoremove_wake_function+0x0/0x2e
   [800a089a] keventd_create_kthread+0x0/0xc4
   [880372c5] :jbd:kjournald+0x0/0x213

Re: [Lustre-discuss] recovering formatted OST

2010-10-22 Thread Bernd Schubert

Er no, mkfs.lustre --index=${the_right_index}.


Cheers,
Bernd

On Friday, October 22, 2010, Wojciech Turek wrote:
 Ok, but this means that new OST will come up with a new index (next
 available). Maybe this is a stupid question, but how  MDS will know that
 the missing files are residing now on a new OST?
 
 On 22 October 2010 18:52, Bernd Schubert bernd.schub...@fastmail.fm wrote:
  Hmm, I would probably format a small fake device on a ramdisk and copy
  files
  over, run tunefs --writeconf /mdt and then start everything (inlcuding
  all OSTs) again.
  
  
  Cheers,
  
  On Friday, October 22, 2010, Wojciech Turek wrote:
   I have tried Bernd's suggestion and it seem to have worked, after
   running e2fsck -D ll_recover_lost_found_objs didn't cause kernel panic
   but moved
  
  a
  
   number of objects to O directory. Problem is that I do not have
   last_rcvd file so the OST has no index at the moment. What would be
   the next step
  
  to
  
   enable access to those files in the filesystem?
   
   Best regards,
   
   Wojciech
   
   On 22 October 2010 17:15, Andreas Dilger andreas.dil...@oracle.com
  
  wrote:
On 2010-10-22, at 5:42, Bernd Schubert bernd.schub...@fastmail.fm
  
  wrote:
 Hmm, e2fsck didn't catch that? rec_len is the length of a directory

entry, so

 after how many bytes the next entry follows.

I agree that e2fsck should have caught that.

 You can try to force e2fsck to do
 something about that: e2fsck -D

No, I would recommend against using -D at this point. That will cause
  
  it
  
to re-write the directory contents, and given that the filesystem was
previously corrupted I would prefer making as few changes as possible
before the data is estranged.

Wojciech,
note that if you are able to mount the filesystem you could just copy
  
  all
  
of the objects (with xattrs!) from lost+found on the bad filesystem,
along with the last_rcvd file (if you can find it) into a new ldiskfs
filesystem and then run ll_recover_lost_found_objs on that.

 On Friday, October 22, 2010, Wojciech Turek wrote:
 Ok, removing and recreating the journal fixed that problem and I
 am able

to

 mount device as ldiskfs filesystem. Now I hit another wall when
  
  trying
  
to

 run ll_recover_lost_found_objs
 When I first time run ll_recover_lost_found_objs -d
 /mnt/ost/lost+found

it

 only creates the O dir and exits. When I repeat this command again

kernel

 panics. Any idea what could be the problem here?
 
 
 LDISKFS-fs error (device dm-4): ldiskfs_readdir: bad entry in
 directory #6831: rec_len is smaller than minimal - offset=0,
  
  inode=0,
  
 rec_len=0, name_len=0
 Aborting journal on device dm-4.
 Unable to handle kernel NULL pointer dereference at
 

RIP:
 [88033448] :jbd:journal_commit_transaction+0xc5b/0x12db
 PGD 1a118d067 PUD 1ce7e7067 PMD 0
 Oops: 0002 [1] SMP
 last sysfs file: /class/infiniband_mad/umad0/port
 CPU 3
 Modules linked in: ldiskfs(U) crc16(U) autofs4(U) hidp(U) l2cap(U)
 bluetooth(U) rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U)
 ib_ipoib(U) ipoib_helper(U) ib_cm(U) ipv6(U) xfrm_nalgo(U)
 crypto_api(U)

ib_uverbs(U)

 ib_umad(U) mlx4_vnic(U) mlx4_vnic_helper(U) ib_sa(U) ib_mthca(U)

mptctl(U)

 dm_mirror(U) video(U) backlight(U) sbs(U) power_meter(U) hwmon(U)

i2c_ec(U)

 i2c_core(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U)
 acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) sr_mod(U)

cdrom(U)

 mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) mlx4_core(U)
  
  usb_storage(U)
  
 pcspkr(U) shpchp(U) serio_raw(U) i5000_edac(U) edac_mc(U)
  
  dm_raid45(U)
  
 dm_message(U) dm_region_hash(U) dm_log(U) dm_mod(U)
 dm_mem_cache(U)

nfs(U)

 lockd(U) fscache(U) nfs_acl(U) sunrpc(U) mptsas(U) mptscsih(U)

mptbase(U)

 scsi_transport_sas(U) mppVhba(U) megaraid_sas(U) mppUpper(U) sg(U)
 sd_mod(U) scsi_mod(U) bnx2(U) ext3(U) jbd(U) uhci_hcd(U)
 ohci_hcd(U) ehci_hcd(U) Pid: 11360, comm: kjournald Tainted: G
 2.6.18-194.3.1.el5_lustre.1.8.4 #1
 RIP: 0010:[88033448]  [88033448]
 
 :jbd:journal_commit_transaction+0xc5b/0x12db
 
 RSP: 0018:8101c6481d90  EFLAGS: 00010246
 RAX:  RBX:  RCX: 
 RDX:  RSI: 8101e9dab0c0 RDI: 81022fa46000
 RBP: 81022fa46000 R08: 81022fa46068 R09: 
 R10: 810105925b20 R11: fffa R12: 
 R13:  R14: 8101e9dab0c0 R15: 
 FS:  () GS:810107b9a4c0()
 knlGS: CS:  0010 DS: 0018 ES: 0018 CR0

Re: [Lustre-discuss] recovering formatted OST

2010-10-22 Thread Bernd Schubert

On Friday, October 22, 2010, Andreas Dilger wrote:
 On 2010-10-22, at 12:25, Wojciech Turek wrote:
  Actually I remember now, Andreas wrote some time ago that when one adds
  OST in to the same slot as the old one MDS will think that the OST have
  objects up to the what old OST had, and when the new OST starts it will
  recreate those objects which may use a lot of inodes and space. So loop
  device or ramdisk maybe not enough for that?
 
 The ll_recover_lost_found_objs will at least recreate the O/0/LAST_ID file
 with the highest-available object ID, but given the corruption of the
 filesystem this may not cover all of the objects previously created.  I
 would suggest to read the last_id for this OST from the MDS:
 
 mds lctl get_param osc.*.prealloc_last_id
 
 and then use a binary editor to set the LAST_ID on the recovered OST, if it
 is significantly different.


Hmm, if you remember, I have in my last_id patch a TODO: new tool?. What 
about simply manually creating an empty file on the OST with that ID (in the 
right obj-id % 32) directory and then let e2fsck do the job (I guess our DDN 
e2fsck is the only one which can do that so far).


Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] recovering formatted OST

2010-10-21 Thread Bernd Schubert

Hello Wojciech Turek,


On Thursday, October 21, 2010, Wojciech Turek wrote:
 Hi Andreas,
 
 I have restarted fsck after the segfault and it ran for several hours and
 it segfaulted again.
 
 Pass 3A: Optimizing directories
 Failed to optimize directory ??? (73031): EXT2 directory corrupted
 Failed to optimize directory ??? (73041): EXT2 directory corrupted
 Failed to optimize directory ??? (75203): EXT2 directory corrupted
 Failed to optimize directory ??? (75357): EXT2 directory corrupted
 Failed to optimize directory ??? (75744): EXT2 directory corrupted
 Failed to optimize directory ??? (75806): EXT2 directory corrupted
 Failed to optimize directory ??? (75825): EXT2 directory corrupted
 Failed to optimize directory ??? (75913): EXT2 directory corrupted
 Failed to optimize directory ??? (75926): EXT2 directory corrupted
 Failed to optimize directory ??? (76034): EXT2 directory corrupted
 Failed to optimize directory ??? (76083): EXT2 directory corrupted
 Failed to optimize directory ??? (76142): EXT2 directory corrupted
 Failed to optimize directory ??? (76266): EXT2 directory corrupted
 Failed to optimize directory ??? (76501): EXT2 directory corrupted
 Failed to optimize directory ??? (77133): EXT2 directory corrupted
 Failed to optimize directory ??? (77212): EXT2 directory corrupted
 Failed to optimize directory ??? (77817): EXT2 directory corrupted
 Failed to optimize directory ??? (77984): EXT2 directory corrupted
 Failed to optimize directory ??? (77985): EXT2 directory corrupted
 Segmentation fault

Maybe try to disable dirindex?

 
 I noticed that the stack limit was quite low so I now changed it to
 unlimited, also I increased limit for number of open files (maybe it can
 help).
 
 Now I have another problem. After last segfault I can not restart the fsck
 due to MMP.
 
 e2fsck -fy /dev/scratch2_ost16vg/ost16lv
 e2fsck 1.41.10.sun2 (24-Feb-2010)
 e2fsck: MMP: fsck being run while trying to open
 /dev/scratch2_ost16vg/ost16lv
 
 The superblock could not be read or does not describe a correct ext2
 filesystem.  If the device is valid and it really contains an ext2
 filesystem (and not swap or ufs or something else), then the superblock
 is corrupt, and you might try running e2fsck with an alternate superblock:
 e2fsck -b 32768 device
 
 
 Also when I try to access filesystem via debugfs it fails:
 
 debugfs -c -R 'ls' /dev/scratch2_ost16vg/ost16lv
 debugfs 1.41.10.sun2 (24-Feb-2010)
 /dev/scratch2_ost16vg/ost16lv: MMP: fsck being run while opening filesystem
 ls: Filesystem not open

 
 Is there a way to clear teh MMP flag so it allows fsck to run?

you can try tune2fs -f -E clear-mmp

However, with a corrupted filesystem, that might not work. You can download a 
fixed e2fsprogs from my homepage, that does allow to run read-only operations 
(such as 'debugfs -c' or 'dumpe2fs -h') in read-only mode. Then you check 
which block is the MMP block and zero that.

http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/

(just reminds me, I need to upload it to our DDN download site)


Also, do you really want to use data files, that might have been zeroed in 
their middle? I think If at all your recovery will only be useful for small 
human readable text files


Hope it helps,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] high CPU load limits bandwidth?

2010-10-20 Thread Bernd Schubert

That is normal and probably comes from the page cache, should be about the 
same for lustre, ldiskfs, ext4, xfs, etc. It goes down if you specify 
-odirect, but which is obviously not optimal on Lustre clients.


Cheers,
Bernd

On Wednesday, October 20, 2010, Andreas Dilger wrote:
 Is this client CPU or server CPU?  If you are using Ethernet it will
 definitely be CPU hungry and can easily saturate a single core.
 
 Cheers, Andreas
 
 On 2010-10-20, at 8:41, Michael Kluge michael.kl...@tu-dresden.de wrote:
  Hi list,
  
  is it normal, that a 'dd' or an 'IOR' pushing 10MB blocks to a lustre
  file system shows up with a 100% CPU load within 'top'? The reason why I
  am asking this is that I can write from one client to one OST with 500
  MB/s. The CPU load will be at 100% in this case. If I stripe over two
  OSTs (which use different OSS servers and different RAID controllers) I
  will get 500 as well (seeing 2x250 MB/s on the OSTs). The CPU load will
  be at 100% again.
  
  A 'dd' on my desktop pushing 10M blocks to the local disk shows 7-10%
  CPU load.
  
  Are there ways to tune this behavior? Changing max_rpcs_in_flight and
  max_dirty_mb did not help.
  
  
  Regards, Michael
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] mkfs options/tuning for RAID based OSTs

2010-10-20 Thread Bernd Schubert

On Wednesday, October 20, 2010, Charland, Denis wrote:
 Brian J. Murrell wrote:
  On Tue, 2010-10-19 at 21:00 -0400, Edward Walter wrote:
  
  
  This is why the recommendations in this thread have continued to be
  using a number of data disks that divides evenly into 1MB (i.e. powers
  of 2: 2, 4, 8, etc.).  So for RAID6: 4+2 or 8+2, etc.
 
 What about RAID5?

Personally I don't lile raid5 too much, but with raid5 it is obviously +1 
instead of +2


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] ldiskfs performance vs. XFS performance

2010-10-20 Thread Bernd Schubert

For your final final filesystem you still probably want to enable async 
journals (unless you are willing to enable the S2A unmirrored device cache).

Most obdecho/obdfilter-survey bugs are gone in 1.8.4, except your ctrl+c 
problem, for which a patch exists:

https://bugzilla.lustre.org/show_bug.cgi?id=21745

Cheers,
Bernd


On Wednesday, October 20, 2010, Michael Kluge wrote:
 Thanks a lot for all the replies. sgpdd shows 700+ MB/s for the device.
 We trapped into one or two bugs with obdfilter-survey as lctl has at
 least one bug in 1.8.3 when is uses multiple threads and
 obdfilter-survey also causes an LBUG when you CTRL+C it. We see 600+
 MB/s for obdfilter-survey over a reasonable parameter space after we
 changed to the ext4 based ldiskfs. So that seems to be the trick.
 
 Michael
 
 Am Montag, den 18.10.2010, 14:04 -0600 schrieb Andreas Dilger:
  On 2010-10-18, at 10:40, Johann Lombardi wrote:
   On Mon, Oct 18, 2010 at 01:58:40PM +0200, Michael Kluge wrote:
   dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
   mke2fs -O journal_dev -b 4096 $RAM_DEV
   
   mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
   --mgsnode=$MDS_NID --mkfsoptions=-E stride=32,stripe-width=256 -b
   4096 -j -J device=$RAM_DEV /dev/disk/by-path/...
   
   mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
   
   In fact, Lustre uses additional mount options (see Persistent mount
   opts in tunefs.lustre output). If your ldiskfs module is based on
   ext3, you should add the extents and mballoc options which are known
   to improve performance.
  
  Even then, the IO submission path of ext3 from userspace is not very
  good, and such a performance difference is not unexpected.  When
  submitting IO from userspace to ext3/ldiskfs it is being done in 4kB
  blocks, and each block is allocated separately (regardless of mballoc,
  unfortunately).  When Lustre is doing IO from the kernel, the client is
  aggregating the IO into 1MB chunks and the entire 1MB write is allocated
  in one operation.
  
  That is why we developed the delalloc code for ext4 - so that userspace
  could also get better IO performance, and utilize the multi-block
  allocation (mballoc) routines that have been in ldiskfs for ages, but
  only accessible from the kernel.
  
  For Lustre performance testing, I would suggest looking at lustre-iokit,
  and in particular sgpdd to test the underlying block device, and then
  obdfilter-survey to test the local Lustre IO submission path.
  
  Cheers, Andreas
  --
  Andreas Dilger
  Lustre Technical Lead
  Oracle Corporation Canada Inc.


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Maximum OST Size

2010-10-20 Thread Bernd Schubert

On Wednesday, October 20, 2010, Andreas Dilger wrote:
 On 2010-10-19, at 08:27, Roger Spellman wrote:
  I don't understand this comment:
  For the MDT, yes, you could potentially use -i 1500 as about the
  minimum space per inode, but then you risk running out of space in the
  filesystem before running out of inodes.
  
  If we set -I to 512, then on an MDT, what else is there that would cause
  require 1500 bytes per inode?
 
 With -I 512 that means the actual inode will consume 512 bytes, so with
 -i 1536 there would be 1024 bytes per inode of block space still
 available.  That extra space is needed for everything else in the
 filesystem, including the journal, directory blocks, Lustre metadata
 (last_rcvd, distributed transaction logs, etc), and any external xattr
 blocks for widely-striped files (beyond 12 stripes or so).
 

I have to admit, I entirely fail to understand why we should need 2/3 of the 
filesystem reserved for real file data.

- journal - 400MB - negligible with recent decent MDT sizes (1TiB+)
- directory blocks, maybe, but I have noticed any system where that takes more 
than 5%
- Lustre metadata  (last_rcvd, distributed transaction logs, etc) - 
negligible with recent decent MDT sizes

- external xattr for Lustre lov and additional ACLs: Maybe, depends on the 
customer

With the default -i 4096, it looks like that for most customers I know of:

df -h:
973G   57G  861G   7% /lustre/lustre/mdt

df -ih:
278M248M 31M   89% /lustre/lustre/mdt


So doubling inode ratio to -i2048 or even quadrupling it to -i1024 seems to be 
recommendable. 


Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] high CPU load limits bandwidth?

2010-10-20 Thread Bernd Schubert

On Wednesday, October 20, 2010, Andreas Dilger wrote:
 On 2010-10-20, at 10:40, Michael Kluge michael.kl...@tu-dresden.de wrote:
  It is the CPU load on the client. The dd/IOR process is using one core
  completely. The clients and the servers are connected via DDR IB. LNET
  bandwidth is at 1.8 GB/s. Servers have 1.8.3, the client has 1.8.3
  patchless.
 
 If you only have a single threaded write, then this is somewhat unavoidable
 to saturate a CPU due to copy_from_user().  O_DIRECT will avoid this.
 
  Also, disabling data checksums and debugging can help considerably. There
 is a patch in bugzilla to add support for h/w crc32c on Nehalem CPUs to
 reduce this overhead, but still not as fast as no checksum at all.

I think checksums are only visible in ptlrpc CPU time (and most also only for 
reads), but not in the user space benchmark process.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] NFS export

2010-10-18 Thread Bernd Schubert

Hello Alfonso,

On Monday, October 18, 2010, Alfonso Pardo wrote:
 Hello,
 
 I need to export a lustre directory from one lustre-client to another
 client, buy always get the next message in the nfs-server:
 
 Cannot export /data, possibly unsupported filesystem or fsid= required
 
 any suggest?


add to your NFS export line fsid=$some_number. Please also note that if you 
are using at RedHat System, that you should use a patch Lustre Server version 
on the NFS export node (lustre client), as the RedHat default 4K stacksize is 
too small and Lustre patched kernels have increased that to 8K (default on all 
system except RHEL).


Cheers,
Bernd

PS: Btw, Ciemat has a DDN Lustre system, so you could also send requests to 
supp...@ddn.com (please add [Lustre]  in the subject line). 

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] NFS export

2010-10-18 Thread Bernd Schubert

Do *NOT* use 1.8.0 please, that is really old. You can easily install an 
updated kernel and Lustre version on centos 5.2. So you may install upstream 
Oracle 1.8.4 (downloads.lustre.org). 
We also have patched 1.8.3-ddn3.3 with latest RedHat security patches (and 
patches for Lustre):

http://eu.ddn.com:8080/lustre/lustre/1.8.3/ddn3.3/

(1.8.4-ddnX is in testing).


Cheers,
Bernd

On Monday, October 18, 2010, Alfonso Pardo wrote:
 Yes, I'am using a RedHat system (centos 5.2). Please, colud you say me,
 where I can found that patch Lustre Sever?
 
 SSOO: Centos 5.2
 Lustre version: Client lustre 1.8.0
 
 El lun, 18-10-2010 a las 11:32 +0200, Bernd Schubert escribió:
  Hello Alfonso,
  
  On Monday, October 18, 2010, Alfonso Pardo wrote:
   Hello,
   
   I need to export a lustre directory from one lustre-client to another
   client, buy always get the next message in the nfs-server:
   
   Cannot export /data, possibly unsupported filesystem or fsid=
   required
   
   any suggest?
  
  add to your NFS export line fsid=$some_number. Please also note that if
  you are using at RedHat System, that you should use a patch Lustre
  Server version on the NFS export node (lustre client), as the RedHat
  default 4K stacksize is too small and Lustre patched kernels have
  increased that to 8K (default on all system except RHEL).
  
  
  Cheers,
  Bernd
  
  PS: Btw, Ciemat has a DDN Lustre system, so you could also send requests
  to supp...@ddn.com (please add [Lustre]  in the subject line).


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] ldiskfs performance vs. XFS performance

2010-10-18 Thread Bernd Schubert

Hello Michael,

On Monday, October 18, 2010, Michael Kluge wrote:
 Hi list,
 
 we have Lustre 1.8.3 running on a DDN 9900. One LUN (10 discs) formatted
 with XFS shows 400 MB/s if oppressed with one 'dd' and large block
 sizes. One LUN formatted an mounted with ldiskfs (the ext3 based that is
 default in 1.8.3.) shows 110 MB/s. It this the expected behaviour? It
 looks a bit low compared to XFS.

Yes, unfortunately not entirely unexpected, with upstream Oracle versions. 
Firstly, please send a mail to supp...@ddn.com and ask for the udev tuning rpm 
(please add [Lustre] in the subject line).

Then see this MMP issue here:
https://bugzilla.lustre.org/show_bug.cgi?id=23129

which requires 
https://bugzilla.lustre.org/show_bug.cgi?id=22882

(as Lustre requires contributor agreements and as self-signed agreements do 
not work anymore, that presently causes some headache and brought in legacy 
and as always with bureaucracy it takes ages to sort it out - so landing our 
patches is delayed presently).

In order to prevent data corruption in case of controller failures, you should 
also disable the S2A write back cache and enable async-journals instead on 
Lustre (enabled by default in DDN Lustre versions). 

 
 We think with help from DDN we did everything we can from a hardware
 perspective. We formatted the LUN with the correct striping and stripe
 size, DDN adjusted some controller parameters and we even put the file
 system journal on a RAM disk. The LUN has 16 TB capacity. I formated
 only 7 for the moment due to the 8 TB limit.

You should use ext4 based ldiskfs to get more than 8TiB. Our releases use that 
as default.

 
 This is what I did:
 
 mds_nid...@somehwere
 RAM_DEV=/dev/ram1
 dd if=/dev/zero of=$RAM_DEV bs=1M count=1000
 mke2fs -O journal_dev -b 4096 $RAM_DEV
 
 mkfs.lustre  --device-size=$((7*1024*1024*1024)) --ost --fsname=luram
 --mgsnode=$MDS_NID --mkfsoptions=-E stride=32,stripe-width=256 -b 4096
 -j -J device=$RAM_DEV /dev/disk/by-path/...
 
 mount -t ldiskfs /dev/disk/by-path/... /mnt/ost_1
 
 Is there a way to push the bandwidth limit for a single data stream any
 further?

While it could make it difficult with support, you could use our DDN Lustre 
releases:

http://eu.ddn.com:8080/lustre/lustre/1.8.3/ddn3.3/


Hope it helps,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Problem with LNET and openibd on Lustre 1.8.4 while rebooting

2010-09-22 Thread Bernd Schubert

 We then ran into the same problems with openibd hanging on shutdown.  After
 a futile attempt trying to inject a lustre-unload-modules service between
 netfs and openib to run lustre_rmmod.  I tried to hack modprobe.conf to
 eject the lustre modules by inserting this
 
 remove rdma_cm /usr/sbin/lustre_rmmod  /sbin/modprobe  -r --ignore-remove
 rdma_cm
 
 this didn't work either because the openibd service script use rmmod
 instead of modprobe -r (aargghh).
 


All of that seem to be rather ugly workarounds. I think we need to figure out 
why rmmod of infiniband modules not just fails, when still used by lustres 
o2ib moduls.


Cheers,
Bernd


--
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Bernd Schubert

On Friday, September 17, 2010, Andreas Dilger wrote:
 On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
  We're trying to architect a Lustre setup for our group, and want to
  leverage our available resources. In doing so, we've come to consider
  multi-purposing several hosts, so that they'll function simultaneously
  as MDS  OSS.
 
 You can't do this and expect recovery to work in a robust manner.  The
 reason is that the MDS is a client of the OSS, and if they are both on the
 same node that crashes, the OSS will wait for the MDS client to
 reconnect and will time out recovery of the real clients.

Well, that is some kind of design problem. Even on separate nodes it can 
easily happen, that both MDS and OSS fail, for example power outage of the 
storage rack. In my experience situations like that happen frequently...

I think some kind a pre-connection would be required, where a client can tell 
a server, that it was rebooted and that the server shall not to wait any 
longer for it. Actually, shouldn't be that difficult, as already different 
connection flags exist. So if the client contacts a server and ask for an 
initial connection, the server could check for that NID and then immediately 
abort recovery for that client.


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

2010-09-17 Thread Bernd Schubert

Hello Cory,

On 09/17/2010 11:31 PM, Cory Spitz wrote:
 Hi, Bernd.
 
 On 09/17/2010 02:48 PM, Bernd Schubert wrote:
 On Friday, September 17, 2010, Andreas Dilger wrote:
 On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
 We're trying to architect a Lustre setup for our group, and want to
 leverage our available resources. In doing so, we've come to consider
 multi-purposing several hosts, so that they'll function simultaneously
 as MDS  OSS.

 You can't do this and expect recovery to work in a robust manner.  The
 reason is that the MDS is a client of the OSS, and if they are both on the
 same node that crashes, the OSS will wait for the MDS client to
 reconnect and will time out recovery of the real clients.

 Well, that is some kind of design problem. Even on separate nodes it can 
 easily happen, that both MDS and OSS fail, for example power outage of the 
 storage rack. In my experience situations like that happen frequently...

 
 I think that just argues that the MDS should be on a separate UPS.

well, there is not only a single reason. Next hardware issue is that
maybe an IB switch fails. And then have also seen cascading Lustre
failures. It starts with an LBUG on the OSS, which triggers another
problem on the MDS...
Also, for us this actually will become a real problem, which cannot be
easily solved. So this issue will become a DDN priority.


Cheers,
Bernd

--
Bernd Schubert
DataDirect Networks



signature.asc
Description: OpenPGP digital signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] mixing server versions

2010-09-15 Thread Bernd Schubert

On Wednesday, September 15, 2010, Andreas Dilger wrote:
 On 2010-09-15, at 13:32, Brock Palen wrote:
  Thanks that is great to know.  Is there much risk to try with 1.8 and
  then back off to 1.6 if there are issues?  Risk of data loss?
 
 We do not test/support formatting at a higher Lustre version, and then
 downgrading below the original version used for formatting.  With 1.6-1.4
 this definitely did not work, though I'm not sure if there are specific
 incompatibilities between 1.8-1.6.

I go back and forth from 1.8 to 1.6 quite often. The only thing to take care 
about is to make sure the MDS does not get the extents flag if ext4-ldiskfs is 
used. I think only beginning with 1.8.4 that is ensured by Lustre itself.

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Cannot get an OST to activate

2010-09-10 Thread Bernd Schubert

 Assuming the disk really is empty then, and LAST_ID really is zero,
 shall I then leave it at zero, and follow the recommendation of
 page 23-14, ie, just shut down again, delete the lov_objid file on
 the MDS, and restart the system?  Certainly the value at the
 correct index (29) is definitely hosed: # od -Ax -td8
 /mnt/mdt/lov_objid (snip) d0   292648
 346413 e068225 -7137254917378053186 f0
 5906459607 00010059227
 59414
 
 Yes, that is definitely hosed.  Deleting the lov_objid file from the
 MDS and remounting the MDS should fix this value.  You could also
 just binary edit the file and set this to 1.

Andreas, Bob, please be very very careful with lov_objid. As I already
wrote last week, I get reproducibly and always a hard kernel panic when
I tested and deleted the file and then mounted the MDT again.
You can try it, but DO CREATE A BACKUP of this file, so that you can
copy it back, if something goes wrong.

Sorry, I don't have the time right now to work on the
lob_objid-delete-bug, not even time to write a suitable bug report :(


Cheers,
Bernd



signature.asc
Description: OpenPGP digital signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Large directory performance

2010-09-10 Thread Bernd Schubert

On Saturday, September 11, 2010, Andreas Dilger wrote:
 On 2010-09-10, at 12:11, Michael Robbert wrote:
  Create performance is a flat line of ~150 files/sec across the board.
  Delete performance is all over the place, but no higher than 3,000
  files/sec... Then yesterday I was browsing the Lustre Operations Manual
  and found section 33.8 that says Lustre is tested with directories as
  large as 10 million files in a single directory and still get lookups at
  a rate of 5,000 files/sec. That leaves me wondering 2 things. How can we
  get 5,000 files/sec for anything and why is our performance dropping off
  so suddenly at after 20k files?
  
  Here is our setup:
  All IO servers are Dell PowerEdge 2950s. 2 8-core sockets with X5355  @
  2.66GHz and 16Gb of RAM. The data is on DDN S2A 9550s with 8+2 RAID
  configuration connected directly with 4Gb Fibre channel.
 
 Are you using the DDN 9550s for the MDT?  That would be a bad
 configuration, because they can only be configured with RAID-6, and would
 explain why you are seeing such bad performance.  For the MDT you always

Unfortunately, we failed to copy the scratch MDT in a reasonable time so far. 
Copying several hundreds of million files turned out to take ages ;) But I 
guess Mike did the benchmarks for the other filesystem with an EF3010.

  We have as many as 1.4 million files in a single directory and we now
  have half a billion files that we need to deal with in one way or
  another.

Mike, is there a chance you can try which rate acp reports?

http://oss.oracle.com/~mason/acp/

Also could you please send me your exact bonnie line or script? We could try 
to reproduce it on and idle test 9550 with a 6620 for metada (the 6620 is 
slower for that than the ef3010).


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Cannot get an OST to activate

2010-09-03 Thread Bernd Schubert

On Friday, September 03, 2010, Bob Ball wrote:
 We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of
 8.9TB each.  Within a day of having these on-line, one OST stopped
 accepting new files.  I cannot get it to activate.  The other 5 seem fine.
 
 On the MDS lctl dl shows it IN, but not UP, and files can be read from
 it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5
 
 However, I cannot get it to re-activate:
 lctl --device umt3-OST001d-osc activate
 

[...]


 LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
 umt3-OST001d: ignoring bogus orphan destroy request: obdid
 11309489156331498430 last_id 0
 
 Can anyone tell me what must be done to recover this disk volume?

Check out section 23.3.9 in the Lustre manual (How to Fix a Bad LAST_ID on an 
OST).

It is on my TODO list to write tool to automatically correct the lov_objid, 
but as of now I don't have it yet. Somehow your lov_objid file has a 
completely wrong value for this OST.
Now, when you say files can be read from it, are you sure there are already 
files on that OST? Because the error message says that the last_id is zero and 
so you should not have a single file on it. If that is also wrong, you will 
need to correct it as well. You can do that manually, or you can use a patched 
e2fsprogs version, that will do that for you

Patches are here:
https://bugzilla.lustre.org/show_bug.cgi?id=22734

Packages can be found on my home page:
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/


If you want to do it automatically, you will need to create a lfsck mdsdb file 
(the hdr file is sufficient, see the lfsck section in the manual) and then you 
will need to run e2fsck for that OST as if you want to create an OSTDB file. 
That will start pass6, and if you then run e2fsck *without* -n, so in 
correcting mode, it will correct the LAST_ID file to what it finds on disk. 
With -v it will also tell you the old and the new value and then you will 
need to put that value properly coded into the MDS lov_objid file.


Be careful and create backups of the lov_objid and LAST_ID files.


Hope it helps,
Bern



-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Cannot get an OST to activate

2010-09-03 Thread Bernd Schubert

On Friday, September 03, 2010, Bernd Schubert wrote:
 On Friday, September 03, 2010, Bob Ball wrote:
  We added a new OSS to our 1.8.4 Lustre installation.  It has 6 OST of
  8.9TB each.  Within a day of having these on-line, one OST stopped
  accepting new files.  I cannot get it to activate.  The other 5 seem
  fine.
  
  On the MDS lctl dl shows it IN, but not UP, and files can be read from
  it: 33 IN osc umt3-OST001d-osc umt3-mdtlov_UUID 5
  
  However, I cannot get it to re-activate:
  lctl --device umt3-OST001d-osc activate
 
 [...]
 
  LustreError: 4697:0:(filter.c:3172:filter_handle_precreate())
  umt3-OST001d: ignoring bogus orphan destroy request: obdid
  11309489156331498430 last_id 0
  
  Can anyone tell me what must be done to recover this disk volume?
 
 Check out section 23.3.9 in the Lustre manual (How to Fix a Bad LAST_ID on
 an OST).
 
 It is on my TODO list to write tool to automatically correct the
 lov_objid, but as of now I don't have it yet. Somehow your lov_objid
 file has a completely wrong value for this OST.
 Now, when you say files can be read from it, are you sure there are
 already files on that OST? Because the error message says that the last_id
 is zero and so you should not have a single file on it. If that is also
 wrong, you will need to correct it as well. You can do that manually, or
 you can use a patched e2fsprogs version, that will do that for you
 
 Patches are here:
 https://bugzilla.lustre.org/show_bug.cgi?id=22734
 
 Packages can be found on my home page:
 http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/e2fsprogs/
 
 
 If you want to do it automatically, you will need to create a lfsck mdsdb
 file (the hdr file is sufficient, see the lfsck section in the manual) and
 then you will need to run e2fsck for that OST as if you want to create an
 OSTDB file. That will start pass6, and if you then run e2fsck *without*
 -n, so in correcting mode, it will correct the LAST_ID file to what it
 finds on disk. With -v it will also tell you the old and the new value
 and then you will need to put that value properly coded into the MDS
 lov_objid file.

Update for the lov_objd file, actually, if you rename or delete it (rename it 
please, so that you have a backup), the MDS should be able to re-create it 
from OST LAST_ID data. 
So if the troublesome OST has no data yet, it will be very easy, if it already 
has data, you will need to correct the LAST_ID on that OST first.

Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] MDT backup (using tar) taking very long

2010-09-02 Thread Bernd Schubert

On Thursday, September 02, 2010, Frederik Ferner wrote:
 Hi List,
 
 we are currently reviewing our backup policy for our Lustre file system
 as backups of the MDT are taking longer and longer.
 

Yes, that is due to the size-on-mds feature, which was introduced into 
1.6.7.2.  See bug 

https://bugzilla.lustre.org/show_bug.cgi?id=21376

It has a patch, that also got accepted in upstream tar last week. You may find 
updated RHEL5 tar packages on my home page:

http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] brief 'hangs' on file operations

2010-09-02 Thread Bernd Schubert

On Thursday, September 02, 2010, Andreas Dilger wrote:
 On 2010-09-02, at 06:43, Tina Friedrich wrote:
  Causing most grieve at the moment is that we sometimes see delays
  writing files. From the writing clients end, it simply looks as if I/O
  stops for a while (we've seen 'pauses' of anything up to 10 seconds).
  This appears to be independent of what client does the writing, and
  software doing the writing. We investigated this a bit using strace and
  dd; the 'slow' calls appear to always be either open, write, or close
  calls. Usually, these take well below 0.001s; in around 0.5% or 1% of
  cases, they take up to multiple seconds. It does not seem to be
  associated with any specific OST, OSS, client or anything; there is
  nothing in any log files or any exceptional load on MDS or OSS or any of
  the clients.
 
 This is most likely associated with delays in committing the journal on the
 MDT or OST, which can happen if the journal fills completely.  Having
 larger journals can help, if you have enough RAM to keep them all in
 memory and not overflow.  Alternately, if you make the journals small it
 will limit the latency, at the cost of reducing overall performance.  A
 third alternative might be to use SSDs for the journal devices.

As diamond uses DDN hardware, it should help in general with performance to 
update to 1.8 and to enable the async journal feature. I guess it also might 
help to reduce those delays, as writes are more optimized.

A question, though. Tina, do you use our ddn udev rules, which tune the 
devices for optimized performance? If not, please send a mail to 
supp...@ddn.com and ask for a recent udev rpm please (available for RHEL5 only 
so far, also *might* work on SLES11, but udev syntax changes to often, IMHO). 
And put [lustre] into the subject line please, as the lustre team maintains 
them.

Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] MDT backup (using tar) taking very long

2010-09-02 Thread Bernd Schubert

On Thursday, September 02, 2010, Frederik Ferner wrote:
 Bernd Schubert wrote:
  On Thursday, September 02, 2010, Frederik Ferner wrote:
  we are currently reviewing our backup policy for our Lustre file system
  as backups of the MDT are taking longer and longer.
  
  Yes, that is due to the size-on-mds feature, which was introduced into
  1.6.7.2.  See bug
  
  https://bugzilla.lustre.org/show_bug.cgi?id=21376
  
  It has a patch, that also got accepted in upstream tar last week. You may
  find
 
  updated RHEL5 tar packages on my home page:
 Thanks, I'll give that a go. (Any chance of adding the SRPM to your
 download page?)

I don't like SRPM too much, I updloaded a tar.bz2 instead. It is a hg 
repository and mq managed, so patches (including those RedHat added) are in 
.hg/patches. You will another bugfix there compared to -sun packages.

Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] blk_rq_check_limits errors

2010-09-02 Thread Bernd Schubert

On Thursday, September 02, 2010, Frank Heckes wrote:
 Hi all,
 
 for some of our OSSes a massive amount of errors like:
 
 Sep  2 20:28:15 jf61o02 kernel: blk_rq_check_limits: over max size
 limit.
 
 appearing in /var/log/messages (and dmesg). Does anyone have got a clue
 how-to get of the root cause? Many thanks in advance.

linux/block/blk-core.c

int blk_rq_check_limits(struct request_queue *q, struct request *rq)
{
if (rq-cmd_flags  REQ_DISCARD)
return 0;

if (blk_rq_sectors(rq)  queue_max_sectors(q) ||
blk_rq_bytes(rq)  queue_max_hw_sectors(q)  9) {
printk(KERN_ERR %s: over max size limit.\n, __func__);
return -EIO;
}


I haven't seen that before, but if I should guess, I would guess that dm-* has 
a larger queue than your underlying block device. If that is with your DDN 
storage, can you verify if all those devices are have max_sec_kb tuned to 
maximum?

Also, does that come up with 1.8.4 only? (I have SG_ALL in my mind which was 
increased from 255 to 256, which might not be supported by all scsi host 
adapters).


Cheers,
Bernd 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] blk_rq_check_limits errors

2010-09-02 Thread Bernd Schubert

On Thursday, September 02, 2010, Frank Heckes wrote:
 Hi all,
 
 for some of our OSSes a massive amount of errors like:
 
 Sep  2 20:28:15 jf61o02 kernel: blk_rq_check_limits: over max size
 limit.
 
 appearing in /var/log/messages (and dmesg). Does anyone have got a clue
 how-to get of the root cause? Many thanks in advance.

linux/block/blk-core.c

int blk_rq_check_limits(struct request_queue *q, struct request *rq)
{
if (rq-cmd_flags  REQ_DISCARD)
return 0;

if (blk_rq_sectors(rq)  queue_max_sectors(q) ||
blk_rq_bytes(rq)  queue_max_hw_sectors(q)  9) {
printk(KERN_ERR %s: over max size limit.\n, __func__);
return -EIO;
}


I haven't seen that before, but if I should guess, I would guess that dm-* has 
a larger queue than your underlying block device. If that is with your DDN 
storage, can you verify if all those devices are have max_sec_kb tuned to 
maximum?

Also, does that come up with 1.8.4 only? (I have SG_ALL in my mind which was 
increased from 255 to 256, which might not be supported by all scsi host 
adapters).


Cheers,
Bernd 

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] MDS memory usage

2010-08-26 Thread Bernd Schubert

Hello Frederik,


On Wednesday, August 25, 2010, Frederik Ferner wrote:
 Hi Bernd,
 
 thanks for your reply.
 
 Bernd Schubert wrote:
  On Tuesday, August 24, 2010, Frederik Ferner wrote:
  on our MDS we noticed that all memory seems to be used. (And it's not
  just normal buffers/cache as far as I can tell.)
  
  When we put load on the machine, for example by starting rsync
  on a few clients, generating file lists to copy data from Lustre to
  local disks or just running a MDT backup locally using dd/gzip to copy a
  LVM snapshot to a remote server, kswapd starts using a lot of CPU
  time, sometimes up to 100% of one CPU core.
  
  This is on a Lustre 1.6.7.2.ddn3.5 based file system with about 200TB,
  the MDT is 800GB with 200M inodes, ACLs enabled.
  
  Did you recompile it, or did you use the binaries from my home page (or
  those you got from CV)?
 
 This is a recompiled Lustre version to include the patch from bug 
 22820.
 
  Possibly it is a LRU auto-resize problem, but which has been disabled in
  DDN builds. As our 1.6 releases didn't include a patch for that, you
  would need to specify the correct command options if you recompiled it.
 
 I guess it's likely that I have not specified the correct option. So the
   binaries on your home page are compiled with '--disable-lru-resize'?
 Any other options that you used?

I always enable the health-write, which will help pacemaker to detect IO 
errors (by monitoring /proc/fs/lustre/health_check)

 --enable-health-write 

 
  Another reason might be bug 22771, although that should only come up on
  MDS with more memory you have.
 
 I had a look at that bug and while we have a default stripe count of 1
 so the stripe count should fit into the inode. On the other hand we use
 ACLs in quite a few places, so it seems we might hit this bug if we
 increase the memory from the 16GB currently, correct?

Yeah and I think 16GB should be sufficient for the MDS. 



-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] MDS memory usage

2010-08-24 Thread Bernd Schubert

On Tuesday, August 24, 2010, Frederik Ferner wrote:
 Hi List,
 
 on our MDS we noticed that all memory seems to be used. (And it's not
 just normal buffers/cache as far as I can tell.)
 
 When we put load on the machine, for example by starting rsync
 on a few clients, generating file lists to copy data from Lustre to
 local disks or just running a MDT backup locally using dd/gzip to copy a
 LVM snapshot to a remote server, kswapd starts using a lot of CPU
 time, sometimes up to 100% of one CPU core.
 
 This is on a Lustre 1.6.7.2.ddn3.5 based file system with about 200TB,
 the MDT is 800GB with 200M inodes, ACLs enabled.

Did you recompile it, or did you use the binaries from my home page (or those 
you got from CV)?

Possibly it is a LRU auto-resize problem, but which has been disabled in DDN 
builds. As our 1.6 releases didn't include a patch for that, you would need to 
specify the correct command options if you recompiled it.

Another reason might be bug 22771, although that should only come up on MDS 
with more memory you have.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Fwd: Lustre and Large Pages

2010-08-20 Thread Bernd Schubert

Last week there was an article on lwn.net about Transparent hugepages 
discussed during The fourth Linux storage and filesystem summit. According 
to that article, we might be luckily and those patches might go into RHEL6

If you do not have an lwn.net account you might need to wait a few weeks:
http://lwn.net/Articles/398846/


It links an older article about it, which should be already avaible for all:
http://lwn.net/Articles/359158/

And another one:
http://lwn.net/Articles/374424/


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Question on setting up fail-over

2010-08-10 Thread Bernd Schubert


On Tuesday, August 10, 2010, David Noriega wrote:
 So your script resets the server so there is no fail-over(ie the other
 server takes over resources from that server?) or there is failover
 but you then manually return resources back to the server that was
 reset?

Our ddn ipmi stonith script (external/ipmi_ddn in heartbeat/pacemaker stonith 
terms) only makes absolutely sure the node was really reset. If something 
fails, an error code is reported to pacemaker and then pacemaker (*) will not 
initiate resource fail-over in order to prevent split-brain. 
As Lustre devices use MMP (multiple-mount protection) that is not strictly 
required, in principal. But if something goes wrong. e.g. MMP was accidentally 
not enabled, a double mount could come up and that would cause serious 
filesystem and data corruption... 


Cheers,
Bernd

PS: (*) hearbeat-v1 (and v2/v3 if not in xml/crm mode) also *should* accept 
stonith error codes, but in general, I have seen it more than once that 
hearbeat-v1 run into split-brain and started resources on both cluster nodes. 
That is something where pacemaker does a much better job.

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] How to achieve 20GB/s file system throughput?

2010-07-24 Thread Bernd Schubert

On Saturday, July 24, 2010, henry...@dell.com wrote:
 Hello,
 
 
 
 One of my customer want to set up HPC with thousands of compute nodes.
 The parallel file system should have 20GB/s throughput. I am not sure
 whether lustre can make it. How many IO nodes needed to achieve this
 target?
 
 
 
 My assumption is 100 or more IO nodes(rack servers) are needed.
 

I'm a bit prejudiced, of course, but with DDN storage that would be quite 
simple. With the older DDN S2A 9990, you can get 5GB/s per controller-pair , 
with the newer SFA1 you can get 6.5 to 7GB/s (we are still tuning it) per 
controller pair.
Each controller pair (couplet in DDN terms) usually has 4 servers connected 
and fits into single rack in a 300 drive configuration.
So you can get 20GB/s with 3 or 4 racks and 12 or 16 OSS servers, which is 
much below your 100 IO nodes ;)

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] NFS Export Issues

2010-07-20 Thread Bernd Schubert

On Tuesday, July 20, 2010, William Olson wrote:
  As far as i remember we had to explicitly align the mounting user uids
  and gids. So the mounting uid:gid must be known (/etc/passwd and groups 
  i think) on the mds and allowed to mount stuff.
  
  Could it be a root squash problem by accident ?
 
 I've tried it explicitly with the no_root_squash option and it still
 behaves the same way..
 What I find really frustrating is that if I unmount lustre, I can mount
 the same nfs export, no problems.  As soon as lustre is mounted to that
 directory I can no longer mount that nfs export.
 I don't understand where it's failing..
 
 Thanks for all your help so far, any more ideas?

Have you tried to add fsid=xxx to your exports line? I think with recently 
Lustre versions (I don't remember the implementation details) it should not be 
required any more and so it should not with recent nfs-utils and until-linux 
(the filesystem uuid is automatically used with those, instead of device 
major/minor as fsid), but maybe both type of workarounds conflict on your 
system?

You also might consider to simply use unfs3, although performance will be 
limited to about 120MB/s, as unfs3 is only single threaded. It also does not 
support NFS locks.

If it still does not work out, you should enabled lustre debugging, nfs 
debugging and you probably should use wireshark to see what it going on.


Hope it helps,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] I/O error on clients

2010-07-20 Thread Bernd Schubert

On Tuesday, July 20, 2010, Christopher J. Morrone wrote:
 On 07/07/2010 01:04 AM, Gabriele Paciucci wrote:
  Hi,
  the ptlrcp bug is a problem, but i don't find in the Peter's logs any
  refer to an eviction caused by the ptlrpc but instead by a timeout
  during the comunication between a ost and the client. But Peter could
  make a downgrade to 1.8.1.1 that not suffer by the problem.
 
 The bug that I describe does not have any messages about the ptlrpcd
 performing evictions.  The server's I think it's dead, and I am
 evicting it, and other messages about the server timing out on the
 client are the only messages that you will see with the bug that I
 described.
 
 But like I said, there are many possible reasons for timeouts, so it
 could easily be something else.


For the record, while stress testing Lustre I can easily reproduce evictions 
with any Lustre version. However, it is terribly difficult to debug it without 
additional tools. I have opened a bugzilla for that, but I don't think I will 
have time for those tools any time soon.

https://bugzilla.lustre.org/show_bug.cgi?id=23190


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] tunefs.lustre --print fails on mounted mdt/ost with mmp

2010-07-14 Thread Bernd Schubert

On Wednesday, July 14, 2010, Andreas Dilger wrote:
 On 2010-07-14, at 13:29, Nate Pearlstein wrote:
  Just checking to be sure this isn't a known bug or problem.  I couldn't
  find a bz for this, but it would appear that tunefs.lustre --print fails
  on a lustre mdt or ost device if mounted with mmp.
  
  Is this expected behavior?
 
 Not really expected...  It is reading the mountdata file via debugfs, so
 that should be safe even on a mounted filesystem, but it doesn't work with
 MMP:
 
 # debugfs -c -R stats /dev/vgbackup/lvtest
 debugfs 1.41.10.sun2 (24-Feb-2010)
 /dev/vgbackup/lvtest: MMP: device currently active while opening filesystem
 stats: Filesystem not open
 
 This is already fixed in our next release of e2fsprogs, however, thanks to
 a patch Jim Garlick @ LLNL.  It is the first hunk of the patch at:
 
 https://bugzilla.lustre.org/attachment.cgi?id=30441

Actually not required, as it opens the device read-only. I would recommend the 
first patch from 

https://bugzilla.lustre.org/show_bug.cgi?id=22421 

that also allows to run other e2fs tools, such as e2fsck -n, dumpe2fs -h 
and tunefs.lustre --print on mounted devices.

Cheers,
Bernd


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] files missing after writeconf

2010-07-09 Thread Bernd Schubert

If the device really has been reformated and the data would be very important, 
there also would be the option to to follow the recovery procedure we have 
done last year, after an MDT was accidentally reformated. It took several 
months, also because I was busy with lots of parallel tasks, in the end it was 
mostly successful. Not perfect, but at least a big part of the directory 
structure could be recovered (also thanks to helpful discussions with 
Andreas). I probably should write a document what needs to be done and publish 
the tools. But even with that it will be time consuming, although using the 
existing tools, that will take much less time than last time... 



Cheers,
Bernd

On Friday, July 09, 2010, Andreas Dilger wrote:
 Unmount the MDS and mount it as type ldiskfs and list the ROOT directory.
 If there are no files there then it seems that somehow you have deleted or
 reformatted the MDS Filesystem.
 
 You could also check lost+found at that point in case your files were moved
 by e2fsck for some reason.
 
 Check 'dumpe2fs -h' on the mds device to see what the format time is.
 
 If there are no more files on the MDS then the best you can do is to run
 lfsck and link all the orphan objects into the Lustre lost+found dir and
 look at the file contents to identify them.
 
 If you have a backup it would be easier to just restore from that. Sorry.
 
 Cheers, Andreas
 
 On 2010-07-08, at 19:34, David Gucker dguc...@choopa.com wrote:
  When bringing up the cluster after a full powerdown, the MDS/MGS node
  was reporting the following for for each of the OSTs:
  
  Jul  8 17:16:18 ID6317 kernel: LustreError: 13b-9: Test01-OST claims
  to have registered, but this MGS does not know about it, preventing
  registration.
  Jul  8 17:16:18 ID6317 kernel: LustreError:
  26184:0:(mgs_handler.c:660:mgs_handle()) MGS handle cmd=253 rc=-2
  
  I have two OSS's and checked back to my mkfs commands and it looks like
  I forgot to enable failover in the options.  So I found that I could
  update that flag using tunefs.lustre.  Looking into that a bit I found
  that I should run it with --writeconf flag as well.
  
  So, I unmounted the OST's and ran:
  tunefs.lustre --param failover.mode=failout /dev/iscsi/ost-1.target0
  
  on each of them.   After doing this (and maybe remounting the mds/mgs),
  I was able to mount the OSTs, and then mounted the client but all data
  was missing. The filesystem reports 11% full which is about right for
  the data that was on there but no files.
  
  After reading the docs a bit better I found that I should have done
  things more properly (fully shutdown and unloaded the filesystem, then
  done the writeconf beginning with the mgs).  So I tried running through
  the proceedure a little better and filesystem is in the same state
  (appears to be fine, just shows used space and no files).
  
  I was unable to recreate this in another test cluster (no data loss).
  So, I'm wondering if these files are recoverable at all?  Can anyone
  point me in the right direction, if there is one?
  
  Dave
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] short writes

2010-07-08 Thread Bernd Schubert

On Thursday, July 08, 2010, Brian J. Murrell wrote:
 On Thu, 2010-07-08 at 07:53 -0600, Kevin Van Maren wrote:
  Hi David,
 
 Hey Kevin,
 
  http://www.opengroup.org/onlinepubs/95399/functions/write.html
 
 Heh.  Funny enough, I was reading the exact same URL.
 
  I always thought libc should
  handle the retry for you by default, but I didn't write the spec.
 
 write(2) is a system call, not a libc function.  fwrite(3) is a
 comparable libc function, so libc might be able to handle short
 write(2)s in fwrite(3), but really it should not (IMHO) be mucking with
 write(2) (or any other) system calls.

You have to keep in mind that Gaussian is a Fortran application. Fortran has 
its own IO library and it would be quite possible that the library of some 
compilers can handle a short write, but the library of other compilers can 
not... In my university group we had to deal with quite some weird effects 
between fortran-IO implementations... 

David, did you use PGI or another compiler? Last time I had to deal with 
Gaussian only PGI was supported, but I have not checked for recent Gaussian 
versions.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] How to determine which lustre clients are loading filesystem.

2010-07-08 Thread Bernd Schubert

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/08/2010 11:21 PM, Andreas Dilger wrote:
 On 2010-07-08, at 14:01, Guy Coates wrote:
 Try this script; (It is from Bernd Schubert). It will parse the 
 per-client  proc stats on the mds/oss into something nice and 
 humanly-readable. It is very useful.
 
 I'm not sure I'd quite call it human readable, but it does show
 that there is a need for something to print out stats for all of the
 clients.

Yeah, I agree, it is not perfect yet. Especially it needs to be sorted
by clients doing most IO. That shouldn't be too difficult with the
existing script.

[...]

 Bernd, would you (or anyone) be interested to enhance those tools to
 be able to show stats data from multiple files at once (each prefixed
 by the device name and/or client NID)?  I don't think it makes sense
 to create separate tools for this.

I'm not sure if the existing lustre tools are really what we need. If
you have a cluster with 200 or more clients and then want to figure out
which clients are doing most IO, several lines per client provide too
much output. One line sorted by IO seems to be better, IMHO.

I would be for interested to enhance the existing tools, but then if I
look into the number of open bugs I have, several of those have a higher
priorty (btw, this script is among my bug list (bug 22469)).

Additionally at least still for the next couple of weeks I'm very very
limited with my time to finish my thesis.

Cheers,
Bernd

- --
Bernd Schubert
DataDirect Networks
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkw2TQEACgkQqh74FqyuOzS0XQCgs7J7MqetIr1Y99gIqXBa9ntW
9pgAn2gFp+gI6R2aa3GverrNsR4v9bfO
=YcKt
-END PGP SIGNATURE-
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Max bandwidth through a single 4xQDR IB link?

2010-06-29 Thread Bernd Schubert

Hello Ashley, hello Kevin,

I really see no point to use disks to benchmark performance, when 
lnet_selftest exists. Benchmark order should be:

- test how much the disks can provide
- test network with lnet_selftest

= make sure lustre performance is not much below the 
   min(disks, lnet_selftest)


Cheers,
Bernd



On Tuesday, June 29, 2010, Kevin Van Maren wrote:
 DAPL is a high-performance interface that uses a small shim to provide a
 common DMA API on top of (in this case) the IB verbs layer.  In general,
 there is a very small performance impact to be able to use the common
 API, so you will not get more large-message bandwidth using native IB
 verbs.
 
 I've never had enough disk bandwidth behind a node to saturate a QDR IB
 link, so I'm not sure how high LNET will go.  If you have an IB test
 cluster, you should be able to measure the upper limits by creating an
 OST on a loopback device on tmpfs, although you have to ensure the
 client-side cache is not skewing your results (hint: boot client with
 something like mem=1g to limit the ram they can use for the cache).
 
 While the QDR IB link bandwidth is 4GB/s (or around 3.9GB/s with 2KB
 packets), the maximum HCA bandwidth is normally around 3.2GB/s
 (unidirectional), due to the PCIe overhead of breaking the transaction
 into (relatively) small packets and managing the packet flow
 control/credits.  This is independent of the protocol, and limited by
 the PCIe Gen2 x8 PCIe interface.  You will see somewhat higher bandwidth
 if your system supports and uses a 256 byte MaxPayload, rather than 128
 bytes.  Use lspci to see what your system is using, as in: lspci -vv -d
 15b3: | grep MaxPayload
 
 Kevin
 
 Ashley Pittman wrote:
  Hi,
  
  Could anyone confirm to me the maximum achievable bandwidth over a single
  4xQDR IB link into a OSS node.  I have many clients doing a write test
  over IB and want to know the maximum bandwidth we can expect to see for
  each OSS node.  For MPI over these links we see between 3 and 3.5BG/s
  but I suspect Lustre is capable of more than this because it's not using
  DALP, is this correct?
  
  Ashley.
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Can't put file on specific device or see it it in lfs df -h

2010-06-28 Thread Bernd Schubert

Привет Катя!

On Tuesday, June 22, 2010, Katya Tutlyaeva wrote:
 Hi everybody!
 
 Of course, these devices are successfully mounted on OSS, when I move
 them using hb_takeover on another OSS (even if I move all devices,
 include mdt on second OSS or move these unworking devices on first OSS)
 first two OST's remains up and accessible, second two still N/A in df -h
 and for file striping.
 Please tell me if I missing something..

Can you post the output of lfs check servers on the client side?

 
 Looking forward to your advices!
 

Difficult to say anything without log files.


Пока,
Бернд
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Using brw_stats to diagnose lustre performance

2010-06-16 Thread Bernd Schubert

On Tuesday 15 June 2010, Kevin Van Maren wrote:
 Live is much easier with a 1MB (or 512KB) native raid stripe size.
 
 
 It looks like most IOs are being broken into 2 pieces.  See
 https://bugzilla.lustre.org/show_bug.cgi?id=22850
 for a few tweaks that would help get IOs  512KB to disk.  See also Bug

I played with a similar patch some time ago (blkdev defines), but didn't notice 
any performance improvements on the 9900 DDN S2A. Before increasing those 
values 
I got up to 7M IOs, after doubling MAX_HW_SEGMENTS and 
MAX_PHYS_SEGMENTS max IOs doubled to 14M. Unfortunately, also more IOs in 
between 
magic good IO sizes came up (magic good here: 1, 2, 3 ..., 14),  so e.g. lots 
of 
1008 or 2032, etc. Example numbers from a production system:

LengthPort 1Port 2Port 3Port 4
 KbytesReads   WritesReads   WritesReads   WritesReads   Writes

  960 1DCD 2EEB 1E44 3532 1431 1D7E 14FB 2284  
 
   976 1ACD 34AC 1A0F 48EB 12E2 24AE 11E1 257F 
   
   992 1D46 3787 1CA7 51EB 144C 2E9B 1354 3A62 
   
  1008100A511B5C1039113765 A9B8 FBED 9E9A D457 
   
  1024   BFD41D  111F3C4   BFBE47  11A110D   8C316B   C95178   8E5A9F   C83850 
   
  1040  583  625  538  6C3  3F3  513  413  337 
   

...

  2032  551 1260  50D 136B  3E4 1218  3C8  BA1 
   
  204841B85FDB213B8D1   10185731088B78E02C4A592F48 
   
  2064   FB   20  108   24   BE   19   C7   10 
   
  2080   E3   2F   E6   37   AA   44   C7   1B 
   

...

  7152   55  6C7   58  80C   60  70D   3F  3B4
  7168 449F E335 417C E743 3332 AB34 3686 A568
  7184   291   142   191   140


I don't think it matters for any storage system if max IO is 7M or 14M, but 
those 
sizes in between are rather annoying. And from output of brw_stats I *sometimes*
have no idea how that can happen. On that particular system I took the numbers
from, users mostly don't do streaming writes, so it the reason is clear there.


After tuning the FZJ (Kevin should know that system) 
system, the SLES11 kernel with chained scatter-gathering 
(so the blkdev patch is mostly not required anymore) can do IO sizes up to 12MB.
Unfortunately, also quite some 1008s again out of the blue without an obvious 
reason
(during my streaming writes with obdecho).


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre 1.8.3 on kernel 2.6.22

2010-06-07 Thread Bernd Schubert

Hello Jonas,

On Monday 07 June 2010, Jonas Ambrus wrote:
 Hi Guys,
 
 i tried to compile lustre 1.8.3 on kernel 2.6.22 (vanilla-config).
 The configure-script of lustre works fine. But when I try to make the
 lustre it fails with the following reason:
 ___
 
 Applying ext3-big-endian-check-2.6.22-vanilla
 1 out of 5 hunks FAILED -- saving rejects to file fs/ext3/super.c.rej
 Patch ext3-big-endian-check-2.6.22-vanilla does not apply (enforce with -f)
 ...
 ___
 I already tried to force it, but after that it's just a big mess and
 doesn't help ;)
 Do you have any suggestion to solve this problem?
 
 Background:
 I want to compile it because I also need a KVM-Module in my kernel.
 I'm using CentOS 5.4 but I'm also open for any another solution.

the CentOS4 5.4 kernel (2.6.18-164) has a far more up2date kvm implementation 
than 2.6.22 has. So I really see no reason why should want to try 2.6.22?


Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Selective e2fsprogs installation on Ubuntu

2010-06-03 Thread Bernd Schubert

 as a not
   existing program
   dumpe2fs - useful but not essential?
  
   Please, tell me if I am missing/misunderstanding  something?
 
  Cheers, Andreas
  --
  Andreas Dilger
  Lustre Technical Lead
  Oracle Corporation Canada Inc.
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Failed OST Cleanup

2010-06-02 Thread Bernd Schubert

On Wednesday 02 June 2010, Andreas Dilger wrote:
 On 2010-06-02, at 11:54, Scott Barber wrote:
  I'm now trying to get a list of files that are now corrupt. On one of
  the lustre clients I'm running:
  lfs find --obd sanvol06-OST0013_UUID  my lustre mount point
 
  It starts to list files and then a few minutes later it runs into an
  error and stops:
  cb_find_init: IOC_LOV_GETINFO on filename failed: Input/output error.
 
  In dmesg I see:
  LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue
  returned rc -5, returning -EIO
 
  The file that gets that Input/output error cannot be delete or
  removed from the file system. How can I get around this?
 
 There is a bug in lfs find that it tries to get the file size
  unnecessarily.  You can use lfs getstripe -obd ... instead, and it
  should work even if the OST is down.

Hmm, yes and no. In principle I like the idea that lfs find tries to figure 
out the file size. A couple of years ago I had to deal with 3 disk failure of 
raid6 and although we tried to clone the 3rd failing disk, in the end we lost 
that OST. Now there was stripe size of 4M and a stripe count of 4 configured.
When I then run 'lfs find' to find files located on that OST, it reported lots 
of file, that *would* have data on that OST, if the file would have 
sufficiently large. But then lots of files had been smaller than 1M and so it 
would have been wrong to delete those files. It turned out that 'lfs find' was 
rather useless for us and I simply had to read each file - if read succeeded 
all was fine, it it failed I moved it into a dedicated subdirectory. The 
missing OST later on was recreated (that was more easy that time with 1.4 than 
nowadays) and we only lost a small part of the file, definitely much less than 
what 'lfs find' suggested.

So if 'lfs find' now used the filesize to determine if a file is really 
located on an OST, that would be an improvement. Of course, if it fails at all 
with an IO error, it is also not useful ;)

Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre and Automount

2010-05-28 Thread Bernd Schubert

On Thursday 27 May 2010, Fraser McCrossan wrote:
 David Simas wrote:
  On Thu, May 27, 2010 at 12:50:15PM -0600, Andreas Dilger wrote:
  There have been some reports of problems with automount and Lustre
  that have never been tracked down.  If someone with automount
  experience and config, and time to track this down could investigate
  I'm sure we could work it out.
 
  The autofs that comes with RHEL5 won't mount Lustre.  We got
  it working with autofs-5.0.3-36, from some version of Fedora.
  Later versions of autofs should also work.  The fix to autofs
  is almost trivial.  A function scans the mount command to
  make sure it contains just legal characters.  To get Lustre
  to automount, it needs @ in it's list of such.
 
  We did find that the automounter sometimes failed to remove
  the Lustre record from /etc/mtab on unmounting the file system.
  That would cause subsequent remounts to fail.  Another easy
  fix: link /etc/mtab to /proc/mounts.
 
 Also, remember that you can't mount Lustre subdirectories. That is, you
 can mount your Lustre filesystem as, say, /home, but you can't mount
 /home/username.

Without having checked the code, I think it should be fairly simple to add 
support for that in mount.lustre:

- Cut off the directory from the filesystem name

- Mount Lustre into a temporary directory

- Bind mount lustre into the target directory

The most difficult part will be to write a single entry into /etc/mtab


 
 An approach that we are testing (but haven't tried in production yet)
 was suggested by an earlier post from Andreas Dilger, and involves two
 automounts. The first mounts the base Lustre filesystem(s) somewhere
 (say, /lustre) as a direct mount, the second is /etc/auto.home and looks
 like this:
 
 *-bind:/lustre/
 
 In our case we have an executable map that generates the mount line
 based on the contents of the description field in LDAP (which indicates
 which Lustre FS contains the home - we have two), but the principle is
 the same.

That should work. The disadvantage of bind mounts is that 'lfs' does not 
recognize it as type lustre and therefore all those nice lfs subcommands will 
not work.


Cheers,
Bernd

--
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Future of lustre 1.8.3+

2010-05-27 Thread Bernd Schubert

On Wednesday 26 May 2010, Guy Coates wrote:
 On 26/05/10 17:25, Ramiro Alba Queipo wrote:
  On Wed, 2010-05-26 at 16:48 +0100, Guy Coates wrote:
  One thing to watch out for in your kernel configs is to make sure
  that:
 
  CONFIG_SECURITY_FILE_CAPABILITIES=N
 
  OK. But the question is if this issue still applies for lustre-1.8.3
  and SLES kernel linux-2.6.27.39-0.3.1.tar.bz2. I mean, is quite
  surprising that if this problems persist, Oracle is offering lustre
  packages for SLES11 with CONFIG_SECURITY_FILE_CAPABILITIES=y ???
  I am just about to start testing, so I'd like to clarify this.
 
  The binary SLES packages are fine; it is the source packages that may be
  problematic, depending on your config. There is a bug filed against this
 
  Sorry Guy. May be there is something I am missing, but SLES11 rpm kernel
  server packages for lustre-1.8.3 are created using a config with
  ONFIG_SECURITY_FILE_CAPABILITIES=y (See yourself on the attachement
 
 You are entirely correct.

To be clear here, this is a CLIENT side issue. So whatever you do set on the 
server side is irrelevant. Oracle cannot set kernel options for clients, as 
those are upstream kernels and lustre is compiled patchless against it.

Applying  this Lustre patch from bugzilla#15587 should solve the issue without 
the need to recompile the kernel:

https://bugzilla.lustre.org/attachment.cgi?id=29116


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre and Automount

2010-05-27 Thread Bernd Schubert

On Thursday 27 May 2010, David Simas wrote:
 On Thu, May 27, 2010 at 12:50:15PM -0600, Andreas Dilger wrote:
  There have been some reports of problems with automount and Lustre
  that have never been tracked down.  If someone with automount
  experience and config, and time to track this down could investigate
  I'm sure we could work it out.
 
 The autofs that comes with RHEL5 won't mount Lustre.  We got
 it working with autofs-5.0.3-36, from some version of Fedora.
 Later versions of autofs should also work.  The fix to autofs
 is almost trivial.  A function scans the mount command to
 make sure it contains just legal characters.  To get Lustre
 to automount, it needs @ in it's list of such.

I doubt that you cannot get it working. At my previous job we used NIS based 
automounter for almost everything including Lustre. All based on Debian, so 
I'm not absolutely sure about RedHat. However, it already did work with the 
very old Debian Sarge. The simple trick we had to do was to escape the @, so 
to use \@.

 
 We did find that the automounter sometimes failed to remove
 the Lustre record from /etc/mtab on unmounting the file system.
 That would cause subsequent remounts to fail.  Another easy
 fix: link /etc/mtab to /proc/mounts.

That also happens sometimes without automounter.


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Future of lustre 1.8.3+

2010-05-19 Thread Bernd Schubert

On Wednesday 19 May 2010, Heiko Schröter wrote:
 Am Mittwoch 19 Mai 2010, um 10:33:04 schrieben Sie:
  On 2010-05-19, at 01:40, Heiko Schröter wrote:
   we would like to know which way lustre is heading.
  
   From the s/w repository we see that only redhat and suse ditros seems
   to be supported.
  
   Is this the official policy of the lustre development to stick to
   (only) these two distros ?
 
  On the client side, we will support the main distros that our customers
  are using, namely RHEL/OEL/CentOS 5.x (and 6.x after release), and SLES
  10/11.  We make a best-effort attempt to have the client work with all
  client kernels, but since our resources are limited we cannot test
  kernels other than the supported ones.  I don't see any huge demand for
  e.g. an officially-supported Ubuntu client kernel, but there has long
  been an unofficial Debian lustre package.
 
  On the server side, we will continue to support RHEL5.x and SLES10/11 for
  the Lustre 1.8 release, and RHEL 5.x (6.x is being worked on) for the
  Lustre 2.x release.  Since maintaining kernel patches for other kernels
  is a lot of work, we do not attempt to provide patches for other than
  official kernels.  However, there have in the past been ports of the
  kernel patches to other kernels by external contributors (e.g. FC11,
  FC12, etc) and this will hopefully continue in the future.
 
 The server side is the more critical part as we are using gentoo+lustre
  running a vanilla kernel 2.6.22.19 with the lustre patches version 1.6.6.
  As far as we are concerned it would be nice to have the pathces for the
  vanilla-kernels in 1.8.3+. This would be just fine.
 
 On the other hand if maintaining is the key problem on your side what would
  be a major argument against using a patched sles/rhel on a lustre server

That is what I would recommend and what several groups do (usually with 
Debian, though). 

  not running the sles/rhel distro ? I know a lot of things can happen but
  are these rhel/sles patches do brake some key features of the kernel which
  would  only work under that specific distro ? I've positivley tested a
  lustre client with a sles patched kernel on a gentoo distro. But i'am a
  bit nervous about testing it on our live lustre server system.

The only thing that really might cause trouble is udev, since sysfs 
maintainers like to break old udev versions. I think upcoming Debian Squeeze 
requires 2.6.27 at a minimum.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Compiling lustre 1.8.X on Ubuntu LTS 10.04

2010-05-04 Thread Bernd Schubert

On Tuesday 04 May 2010, Ramiro Alba Queipo wrote:
 Hi everybody,
 
 I would like if anybody is trying to compile lustre 1.8.X on Ubuntu LTS
 10.4/Debian testing(squeeze), and know your opinion/comments on what
 I've got:
 
 
 I've been using lustre 1.8.1.1 with RedHat5 kernel 2.6.18-128.7.1 on
 Ubuntu LTS 8.04 both serves and clients, but now I would like to upgrade
 to the recent Ubuntu LTS 10.04.
 When I try to compile lustre 1.8.1.1 on Ubuntu 10.04 (gcc-4.4 and libc6
 2.11.1 instead gcc-4.2 and libc6-2.7.10 on Ubuntu 8.04) and once
 suppressed all references in to -Werror in configure script (I tried
 --disable-werror, but it did not work), I finally got:

That is bug 22729. A very simple patch (entirely untested) should be:

diff --git a/lnet/include/libcfs/linux/kp30.h b/lnet/include/libcfs/linux/kp30.h
--- a/lnet/include/libcfs/linux/kp30.h
+++ b/lnet/include/libcfs/linux/kp30.h
@@ -386,17 +386,8 @@ extern int  lwt_snapshot (cycles_t *now,
 # define LPF64 l
 #endif

-#ifdef HAVE_SIZE_T_LONG
-# define LPSZ  %lu
-#else
-# define LPSZ  %u
-#endif
-
-#ifdef HAVE_SSIZE_T_LONG
-# define LPSSZ %ld
-#else
-# define LPSSZ %d
-#endif
+#define LPSZ  %zd
+#define LPSSZ %zd

 #ifndef LPU64
 # error No word size defined


Please note that I did not test this patch at all yet.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Compiling lustre 1.8.X on Ubuntu LTS 10.04

2010-05-04 Thread Bernd Schubert

On Tuesday 04 May 2010, Ramiro Alba Queipo wrote:
 On Tue, 2010-05-04 at 14:16 +0200, Bernd Schubert wrote:
  That is bug 22729. A very simple patch (entirely untested) should be:
 
  diff --git a/lnet/include/libcfs/linux/kp30.h
  b/lnet/include/libcfs/linux/kp30.h --- a/lnet/include/libcfs/linux/kp30.h
  +++ b/lnet/include/libcfs/linux/kp30.h
  @@ -386,17 +386,8 @@ extern int  lwt_snapshot (cycles_t *now,
   # define LPF64 l
   #endif
 
  -#ifdef HAVE_SIZE_T_LONG
  -# define LPSZ  %lu
  -#else
  -# define LPSZ  %u
  -#endif
  -
  -#ifdef HAVE_SSIZE_T_LONG
  -# define LPSSZ %ld
  -#else
  -# define LPSSZ %d
  -#endif
  +#define LPSZ  %zd
  +#define LPSSZ %zd
 
   #ifndef LPU64
   # error No word size defined
 
 Thanks Bernd. Now, it compiles
 
  Please note that I did not test this patch at all yet.
 
 I'll follow the bug and test yours
 
 Now a couple of questions:
 
 1) I've compiled RedHat5  2.6.18-164.11.1 kernel using config from file
 config-2.6.18-164.11.1.el5_lustre.1.8.3 extracted from package
 kernel-2.6.18-164.11.1.el5_lustre.1.8.3.x86_64-ext4.rpm from Oracle
 server that says:
 Lustre-patched kernel for ext4(MDS/MGS/OSS Only)
 
 Is it the right one?

I don't think there is a right or a wrong. For RHEL 5.4 kernels ldiskfs is 
either based on ext3 or ext4. The ext3 based version is better tested (with 
default options),  the  ext4 based version has more features (e.g. 16TiB OST 
size support).

 
 2) By looking at infiniband libraries in Ubuntu LTS 10.04/Debian testing
 I could see that there are mainly OFED 1.4.2 except except libibverbs
 and librdmacm packages which seem to be from OFED 1.5.1 (libibverbs) and
 1.5 (librdmaca).
 I suppose it have done this way, due to the 2.6.32 kernel (containing
 OFED 1.5.1 as you can see in docs/OFED_release_notes.txt) and openmpi
 1.4.1 (coming with OFED 1.5.1), but I afraid of having problems when
 using Lustre 1.8.3 on it. I asked Debian maintainer Roland Dreier but he
 did not answer.
 
 Should I worry?
 

You do not need to care about IB libraries in Lustre at all. Lustre only 
accesses the kernel interface. Also, OFED is only a stable collection of 
different libraries and utils. Mixing version is allowed, but is not as well 
tested as the combination provided by OFED.

Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] LBUG: ost_rw_hpreq_check() ASSERTION(nb != NULL) failed

2010-04-19 Thread Bernd Schubert

Hello Erich,

check out my bug report:

https://bugzilla.lustre.org/show_bug.cgi?id=19992

It was closed as duplicate of bug 16129, although that is probably not 
correct, as 16129 is the root cause, but not the solution.

As we never observed it with 1.6.7.2 I didn't complain bug 19992 was closed. 
As you now can confirm it also happens with 1.6.7.2, please re-open that bug.


Thanks,
Bernd

On Monday 19 April 2010, Erich Focht wrote:
 Hi,
 
 we saw this LBUG 3 times within past week, and are puzzled of what's going
  on, and how comes there's no bugzilla entry for this...
 
 What happens is that on an OSS a request (must be read or write) expects
 (according to the content of the ioobj structure) to find an array of 22
  struct niobuf_remote's (niocount), but only finds one. This is obviously
  corrupted.
 
 We enabled checksumming where we could, but unfortunately the request
  headers don't seem to be covered by any checksum check (well, the reply
  path possibly is). Anyway, we see no corruption/checksum failures for bulk
  data transfer, so it's improbable that this is a corruption on the wire,
  that three times in a row says size 16 too small (required X)  (with X
  being 352, 432, 4016 in our failures).
 
 Did anybody see this? Any ideas or hints?
 
 We're using Lustre 1.6.7.2 on server and client side.
 
 
 The LBUG traceback is:
 
 LustreError: 12946:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg
 8101d0c4aad0 buffer[3] size 16 too small (required 352)
 LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) ASSERTION(nb
  != NULL) failed
 LustreError: 12946:0:(ost_handler.c:1594:ost_rw_hpreq_check()) LBUG
 Lustre: 12946:0:(linux-debug.c:222:libcfs_debug_dumpstack()) showing stack
  for process 12946
 ll_ost_io_135 R  running task   0 12946  1 12947 12945
  (L-TLB) 88574438 88abb2e0 063a
  8101d0c4ac28 88abb2e0 88571c20 
   88574a35 88abc7e2 
  0016 Call Trace:
  [88571c20] :libcfs:tracefile_init+0x0/0x110
  [88aac641] :ost:ost_rw_hpreq_check+0x1b1/0x290
  [88ab9ebf] :ost:ost_hpreq_handler+0x50f/0x7c0
  [886d243b] :ptlrpc:ptlrpc_main+0xebb/0x13e0
  [8008a4aa] default_wake_function+0x0/0xe
  [800b4a6d] audit_syscall_exit+0x327/0x342
  [8005dfb1] child_rip+0xa/0x11
  [886d1580] :ptlrpc:ptlrpc_main+0x0/0x13e0
  [8005dfa7] child_rip+0x0/0x11
 
 
 Regards,
 Erich
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lost Files - How to remove from MDT

2010-04-18 Thread Bernd Schubert

On Sunday 18 April 2010, Charles Taylor wrote:
 On Apr 18, 2010, at 9:38 AM, Brian J. Murrell wrote:
  On Sun, 2010-04-18 at 09:30 -0400, Charles Taylor wrote:
  Is there some way to remove these files from the MDT - as though they
  never existed - without reformatting the entire file system?
 
  lfsck is the documented, supported method.
 
 Yes, but we attempted that at one time with a smaller file system (for a
  different reason).   After letting it run for over a day, we estimated
  that it would have taken seven to ten days to finish.   That just wasn't
  practical for us at the time and still isn't.  This file system would
  probably take a couple of weeks to lfsck.  I'm sorry to say we can't take
  the file system offline for that long.

You don't need to take the filesystem offline for lfsck. Also, I have 
rewritten large parts of lfsck and also fixed the parallelization code. I need 
to review all patches again and probably also make a hg or git repository out 
of it. Unfortunately, I always have more tasks to do than I manage to do...
But given the fact that I fixed several bugs and added safety checks, I think 
my version actually is better than upstream.

Let me know if you are interested and I can put a tar ball of 
e2fsprogs-sun-ddn on my home page.



Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Extremely high load and hanging processes on a Lustre client

2010-03-05 Thread Bernd Schubert

On Friday 05 March 2010, Götz Waschk wrote:
 Hi everyone,
 
 I have a critical problem on one of my Lustre client machines running
 Scientific Linux 5.4 and the patchless Lustre 1.8.2 client. After a
 few days of usage, some processes like cp and kswapd0 start to use
 100% CPU. Only 180k of swap space are in use though.
 
 Processes that try to access Lustre use a lot of CPU and seem to hang.
 
 There is some output in the kernel log I'll attach to this mail.
 
 Do you have any idea what to test before rebooting the machine?

Don't reboot, but disable LRU resizing. 

for i in /proc/fs/lustre/ldlm/namespaces/*; do echo 800  ${i}/lru_size; done


At least that helped all the time before when we had that problem. I hoped it 
would be fixed in 1.8.2, but seems it is not. Please open a bug report.


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Rx failures

2010-02-11 Thread Bernd Schubert

On Thursday 11 February 2010, Ulrich Sibiller wrote:
 Ulrich Sibiller schrieb:
  Feb 10 13:33:24 hpc9master02 kernel: LustreError:
  4475:0:(lib-move.c:2436:LNetPut()) Error sending PUT to
  12345-192.168.60@o2ib: -113
 
  Feb  2 16:08:19 hpc9oss1 kernel: Lustre:
  7937:0:(o2iblnd_cb.c:2220:kiblnd_passive_connect()) Conn stale
  192.168.60@o2ib [old ver: 12, new ver: 12]
 
  Feb  2 15:59:27 hpc9mds1 kernel: Lustre:
  5008:0:(o2iblnd_cb.c:2232:kiblnd_passive_connect()) Conn race
  192.168.60@o2ib
 
 For the records: Finally I found the source of these problems: We had two
  IPoIB interfaces in the fabric using the same IP address
  (192.168.60.226)...

I guess next time you should run a lnet_selftest and lctl ping.


Greetings from Tübingen,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Filesystem monitoring in Heartbeat

2010-01-21 Thread Bernd Schubert

On Thursday 21 January 2010, Adam Gandelman wrote:
 Jagga Soorma wrote:
  Hi Guys,
 
  My MDT is setup with LVM and I was able to test failover based on the
  Volume Group failing on my MDS (by unplugging both fibre cables).
  However, for my OST's, I have created filesystems directly on the SAN
  luns and when I unplug the fibre cables on my OSS, heartbeat does not
  detect failure for the filesystem since it shows as mounted.  Is there
  somehow we can trigger a failure based on multipath failing on the OSS?
 
 Hi-
 
 It would depend on the version of heartbeat you are using.  Heartbeat v1
 did not do any resource level monitoring and if that is what you are
 using you are out of luck.
 
 If using v2 CRM and/or Pacemaker, you have two options:
 
 1, Modify the Filesystem OCF script's monitor operation to check the
 actual health of  the filesystem and/or multipath in addition to the
 status of the mount and return accordingly.   The Filesystem OCF agent
 is located at /usr/lib/ocf/resource.d/heartbeat/Filesystem
 2, Create your own resource agent that interacts with dm/multipath to
 start/stop/monitor it.  Then constrain the resource to start before/stop
 after and run with the Filesystem resource.  Then the filesystem will be
 dependent on the health of the multipath resource.

I guess you want to use the pacemaker agent I posted into this bugzilla:

https://bugzilla.lustre.org/show_bug.cgi?id=20807

It does not interact with with multipath, but knows about several lustre 
details. 
How would you monitor multipath? If one of your several paths fails, what do 
you want to do? If all paths fail, it is clear, but what to for a partial path 
failure? I think think OCF defines a return code for that?
I also think mutipath should be a separate agent to reduce complicity from the 
script.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre claims OST is mounted when it is not

2010-01-15 Thread Bernd Schubert

On Friday 15 January 2010, Erik Froese wrote:
 We had an OSS lockup and had to be reset. Heartbeat failed to mount one of
 the OSTs and unmounted all of its local OSTs.
 
 I'm trying to run mount on one of the OSTs (ost08) but it claims its
  mounted when it is not.
 
 I have other OSTs mounted so I can't remove the driver right now. Any
  ideas?
 
 Redhat 5.3
 
 [r...@oss-0-0 ~]# uname -a
 Linux oss-0-0.local 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Oct 6
 05:48:57 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 
 [r...@oss-0-0 ~]# mount | grep ost
 /dev/dsk/ost12 on /mnt/scratch/ost12 type lustre (rw)
 /dev/dsk/ost16 on /mnt/scratch/ost16 type lustre (rw)
 /dev/dsk/ost20 on /mnt/scratch/ost20 type lustre (rw)
 /dev/dsk/ost00 on /mnt/scratch/ost00 type lustre (rw)
 /dev/dsk/ost04 on /mnt/scratch/ost04 type lustre (rw)
 /dev/dsk/ost110 on /mnt/scratch/ost110 type lustre (rw)
 
 [r...@oss-0-0 ~]# umount -f /mnt/scratch/ost08
 umount2: Invalid argument
 umount: /mnt/scratch/ost08: not mounted
 
 [r...@oss-0-0 ~]# e2fsck -n /dev/dsk/ost08 | tee
 /state/partition1/e2fsck-n.ost08_`date '+%m.%d.%y-%H:%M:%S'`.log
 e2fsck 1.41.6.sun1 (30-May-2009)
 device /dev/sdj mounted by lustre per
 /proc/fs/lustre/obdfilter/scratch-OST0018/mntdev
 Warning!  /dev/dsk/ost08 is mounted.
 Warning: skipping journal recovery because doing a read-only filesystem
 check.

see here:

https://bugzilla.lustre.org/show_bug.cgi?id=19566
https://bugzilla.lustre.org/show_bug.cgi?id=21359

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre claims OST is mounted when it is not

2010-01-15 Thread Bernd Schubert

Hello Erik,

unfortunately, there is no solution than to reboot. For some unknown (yet to 
debug reasons) variable references could not given up, so in order to prevent 
NULL point dereferences, Lustre did not umount.


Cheers,
Bernd

On Friday 15 January 2010, Erik Froese wrote:
 Thanks Bernd.
 
 From the bug reports it looks like the OST is actually still mounted by
 lustre, unbeknownst to Linux and VFS.
 Is there a mechanism to unmount it or do I need to reboot?
 
 Erik
 
 On Fri, Jan 15, 2010 at 3:28 PM, Bernd Schubert
 
 bs_li...@aakef.fastmail.fmwrote:
  On Friday 15 January 2010, Erik Froese wrote:
   We had an OSS lockup and had to be reset. Heartbeat failed to mount one
 
  of
 
   the OSTs and unmounted all of its local OSTs.
  
   I'm trying to run mount on one of the OSTs (ost08) but it claims its
mounted when it is not.
  
   I have other OSTs mounted so I can't remove the driver right now. Any
ideas?
  
   Redhat 5.3
  
   [r...@oss-0-0 ~]# uname -a
   Linux oss-0-0.local 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1 SMP Tue Oct 6
   05:48:57 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux
  
   [r...@oss-0-0 ~]# mount | grep ost
   /dev/dsk/ost12 on /mnt/scratch/ost12 type lustre (rw)
   /dev/dsk/ost16 on /mnt/scratch/ost16 type lustre (rw)
   /dev/dsk/ost20 on /mnt/scratch/ost20 type lustre (rw)
   /dev/dsk/ost00 on /mnt/scratch/ost00 type lustre (rw)
   /dev/dsk/ost04 on /mnt/scratch/ost04 type lustre (rw)
   /dev/dsk/ost110 on /mnt/scratch/ost110 type lustre (rw)
  
   [r...@oss-0-0 ~]# umount -f /mnt/scratch/ost08
   umount2: Invalid argument
   umount: /mnt/scratch/ost08: not mounted
  
   [r...@oss-0-0 ~]# e2fsck -n /dev/dsk/ost08 | tee
   /state/partition1/e2fsck-n.ost08_`date '+%m.%d.%y-%H:%M:%S'`.log
   e2fsck 1.41.6.sun1 (30-May-2009)
   device /dev/sdj mounted by lustre per
   /proc/fs/lustre/obdfilter/scratch-OST0018/mntdev
   Warning!  /dev/dsk/ost08 is mounted.
   Warning: skipping journal recovery because doing a read-only filesystem
   check.
 
  see here:
 
  https://bugzilla.lustre.org/show_bug.cgi?id=19566
  https://bugzilla.lustre.org/show_bug.cgi?id=21359
 
  --
  Bernd Schubert
  DataDirect Networks
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] No space left on device for just one file

2010-01-12 Thread Bernd Schubert

 3206259721%
  /lustre/scratch[OST:24] scratch-OST0019_UUID 329427010   5222114 324204896
 1% /lustre/scratch[OST:25] scratch-OST001a_UUID 317921820   5115591
  3128062291% /lustre/scratch[OST:26] scratch-OST001b_UUID 366288896  
  5353229 3609356671% /lustre/scratch[OST:27] scratch-OST001c_UUID
  366288896   5383473 3609054231% /lustre/scratch[OST:28]
  scratch-OST001d_UUID 366288896   5411890 3608770061%
  /lustre/scratch[OST:29] scratch-OST001e_UUID 216236615   617 210047728
 2% /lustre/scratch[OST:30] scratch-OST001f_UUID 366288896   6465049
  3598238471% /lustre/scratch[OST:31]
 
 filesystem summary:  1453492963 174078773 1279414190   11% /lustre/scratch
 
 
 Thanks,
 Mike Robbert
 
 On Jan 11, 2010, at 7:24 PM, Andreas Dilger wrote:
  On 2010-01-11, at 15:59, Michael Robbert wrote:
  The filename is not very unique. I can create a file with the same
  name in another directory or on another Lustre filesystem. It is
  just this exact path on this filesystem. The full path is:
  /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP.
  007100.pfb.00164
  The mount point for this filesystem is /lustre/scratch/
 
  Robert,
  does the same problem happen on multiple client nodes, or is it only
  happening on a single client?  Are there any messages on the MDS and/
  or the OSSes when this problem is happening?  This problem is somewhat
  unusual, since I'm not aware of any places outside the disk filesystem
  code that would cause ENOSPC when creating a file.
 
  Can you please do a bit of debugging on the system:
 
  {client}# cd /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11
  {mds,client}# echo -1  /proc/sys/lustre/debug   # enable full debug
  {mds,client}# lctl clear # clear debug logs
  {client}# touch NLDAS.APCP.007100.pfb.00164
  {mds,client}# lctl dk  /tmp/debug.{mds,client}  # dump debug logs
 
  For now, please extract the ENOSPC error from the logs will be much
  shorter, and may be enough to identify where the problem is located,
  and will be a lot friendlier to the list.
 
  grep -- -28 /tmp/debug.{mds,client}  /tmp/debug-28.{mds,client}::
 
  along with the lfs df and lfs df -i output.
 
  If this is only on a single client, just dropping the locks on the
  client might be enough to resolve the problem:
 
  for L in /proc/fs/lustre/ldlm/namespaces/*; do
  echo clear  $L
  done
 
  If, on the other hand, this same problem is happening on all clients
  then the problem is likely on the MDS.
 
  On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert
 
  mrobb...@mines.edu wrote:
  I have a user that reported a problem creating a file on our
  Lustre filesystem. When I investigated I found that the problem
  appears to be unique to just one filename in one directory. I have
  tried numerous ways of creating the file including echo, touch,
  and lfs setstripe all return No space left on device. I have
  checked the filesystem with df and lfs df both show that the
  filesystem and all OSTs are far from being full for both blocks
  and inodes. Slight changes in the filename are created fine. We
  had a kernel panic on the MDS yesterday and it was quite possible
  that the user had a compute job working in this directory at the
  time of that problem. I am guessing we have some kind of
  corruption with the directory. This directory has around 1 million
  files so moving the data around may not be a quick operation, but
  we're willing to do it. I just want to know the best way, short of
  taking the filesystem offline, to fix this problem.
 
  Any ideas? Thanks in advance,
  Mike Robbert
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
  Cheers, Andreas
  --
  Andreas Dilger
  Sr. Staff Engineer, Lustre Group
  Sun Microsystems of Canada, Inc.
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] No space left on device for just one file

2010-01-11 Thread Bernd Schubert

Hello Robert,

could you please send a mail into our ticket system? Kit or I would then start 
to investigate tomorrow.

Thanks,
Bernd

On Monday 11 January 2010, Michael Robbert wrote:
 The filename is not very unique. I can create a file with the same name in
  another directory or on another Lustre filesystem. It is just this exact
  path on this filesystem. The full path is:
  /lustre/scratch/smoqbel/Cenval/CLM/Met.Forcing/18X11/NLDAS.APCP.007100.pfb
 .00164 The mount point for this filesystem is /lustre/scratch/
 
 Thanks,
 Mike
 
 On Jan 11, 2010, at 5:52 AM, Mag Gam wrote:
  Can you paste us the file name? I want to see if we can touch
  something like this.
 
  On Fri, Jan 8, 2010 at 1:36 PM, Michael Robbert mrobb...@mines.edu 
wrote:
  I have a user that reported a problem creating a file on our Lustre
  filesystem. When I investigated I found that the problem appears to be
  unique to just one filename in one directory. I have tried numerous ways
  of creating the file including echo, touch, and lfs setstripe all
  return No space left on device. I have checked the filesystem with df
  and lfs df both show that the filesystem and all OSTs are far from
  being full for both blocks and inodes. Slight changes in the filename
  are created fine. We had a kernel panic on the MDS yesterday and it was
  quite possible that the user had a compute job working in this directory
  at the time of that problem. I am guessing we have some kind of
  corruption with the directory. This directory has around 1 million files
  so moving the data around may not be a quick operation, but we're
  willing to do it. I just want to know the best way, short of taking the
  filesystem offline, to fix this problem.
 
  Any ideas? Thanks in advance,
  Mike Robbert
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] failover problems using separated journal disk

2009-12-23 Thread Bernd Schubert

Hello Antonio,

On Wednesday 23 December 2009, Antonio Concas wrote:
 Hi, all
 
 Dec 23 11:20:29 mommoti12 kernel: LDISKFS-fs: external journal has bad
 superblock

see here:

https://bugzilla.lustre.org/show_bug.cgi?id=21389


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic

2009-12-22 Thread Bernd Schubert

Hello Nick,

at least I'm not aware on any drawbacks. 


Cheers,
Bernd

On Tuesday 22 December 2009, Nick Jennings wrote:
 Thanks for this tip Bernd. I'll be unable to upgrade for a while, so
 this is a very useful workaround. Does it have any drawbacks I should be
 aware of?
 
 On 12/22/2009 12:35 AM, Bernd Schubert wrote:
  On Monday 21 December 2009, Andreas Dilger wrote:
  On 2009-12-21, at 11:15, Nick Jennings wrote:
  I had another instance of the client kernel panic which I first
  encountered a few months ago. This time I managed to get a shot of the
  console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the
  JPG is from the console of wsn1(client).
 
  I see bug 19841, which has at least part of this stack
  (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614.  The
  latter bug is marked landed for 1.8.0 and later releases.
 
  Nick, if you do not want to upgrade or patch your Lustre version, the
  workaround for this is to disable lockless truncates.
 
 
  # on all clients
  for i in /proc/fs/lustre/llite/*; do
  echo 0  ${i}/lockless_truncate;
  done
 
 
  Cheers,
  Bernd
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic

2009-12-22 Thread Bernd Schubert

On Tuesday 22 December 2009, Nick Jennings wrote:
 On 12/21/2009 07:36 PM, Brian J. Murrell wrote:
  Photographs of 25 line console screens are not very often suitable
  substitutes for real console logging, unfortunately.  Seriously, if you
  really want to pursue this issue, you are going to have to set up some
  form of console logging.  I think netconsole is usually fairly
  successful at capturing kernel oops dumps.  Maybe that's an option.
  ISTR mentioning netconsole the last time though.  Maybe that was another
  thread.
 
  You're right, I just hadn't gotten around to getting netconsole set up
 like I planned. *blush* :)
 

Most servers nowadays have IPMI and an IPMI SOL is much better. 


Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic

2009-12-22 Thread Bernd Schubert

On Tuesday 22 December 2009, David Dillow wrote:
 On Tue, 2009-12-22 at 18:09 +0100, Bernd Schubert wrote:
  On Tuesday 22 December 2009, Nick Jennings wrote:
   On 12/21/2009 07:36 PM, Brian J. Murrell wrote:
Photographs of 25 line console screens are not very often suitable
substitutes for real console logging, unfortunately.  Seriously, if
you really want to pursue this issue, you are going to have to set up
some form of console logging.  I think netconsole is usually fairly
successful at capturing kernel oops dumps.  Maybe that's an option.
ISTR mentioning netconsole the last time though.  Maybe that was
another thread.
  
You're right, I just hadn't gotten around to getting netconsole set up
   like I planned. *blush* :)
 
  Most servers nowadays have IPMI and an IPMI SOL is much better.
 
 Heh, I'd like to know what servers you are running. Our experience with
 IPMI SOL on a variety of systems has been anything but reliable. It has
 a notorious habit of dropping out under any sort of load, such as during
 an oops where you need it the most.
 
 It's still better than nothing, but it's a crapshoot.
 

Yes, I know about IPMI issues, of course. In my experience, SuperMicro IPMI 
with an additional NIC port works perfectly. I don't know about their most 
recent mainboards and BMCs, though. The first week I started for DDN I learned 
that Dell-DRAC5 has a bug and does not send a break (sysrq). According to 
Dell, this is fixed in their recent firmware released 7 days ago (I opened the 
'priority' call on March 6th), but I could not check yet. 
Also working rather well is HP ilO, although not with SOL, but their build in 
vsp. The problem with vsp is that cursor-keys do not work and navigating 
through the grub-menu is a pain, unless you know emacs shortcuts in and out 
(I'm a vi user...).


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Implementing MMP correctly

2009-12-22 Thread Bernd Schubert

Michael,

to answer your question on the pacemaker mailing list, if you do use the an 
agent that also checks for all umount bugs, it might work without mmp, but you 
still remove a very useful protection. And the situation didn't change since 
October when you asked a similar question last time ;)

On Tuesday 22 December 2009, Jim Garlick wrote:
 On Tue, Dec 22, 2009 at 02:12:44PM +0100, Michael Schwartzkopff wrote:
  Hi,
 
  I am trying to understand howto implement MMP correctly into a lustre
  failover cluster.
 
  As far as I understood the MMP protects the same filesystem beeing
  mounted by different nodes (OSS) of a failover cluster. So far so good.
 
  If a node was shut down uncleanly it still will occupy its filesystems by
  MMP and thus preventing the clean failover to an other node.

How did you get this idea at all?

 
 Hi, ldiskfs (or e2fsck) will poll the MMP block to see if the other side
 is still updating it before starting.  If updates have ceased, the mount
 or fsck will start.  So the workarounds below are unnecessary.
 
  Now I want to
  implement a clean failover into the Filesystem Resource Agent of
  pacemaker. Is there a good way to solve the problem with MMP? Possible
  sotutions are:
 
  - Disable the MMP feature in a cluster at all, since the resource manager
  takes care that the same resource is only mounted once in the cluster
 
  - Do a tunefs -O ^mmp device and a tunefs -O mmp device before
  every mounting of a resource?
 
 tune2fs -Eclear-mmp is a faster alternative.

Doing that for each and every mount would basically remove the MMP protection. 
I think Michael wants to write an agent that does that automatically...

And again, I submitted and updated a suitable agent in Lustre bugzilla 20807. 
It is almost ready to be submitted to heartbeat/pacemaker, I only need to 
clean up some comments and slightly simplify some umount checks.

 
 Should only be necessary if e2fsck is interrupted.
 (e2fsck does not regularly update the MMP block like the file system does)
 
  - Do a sleep 10 before mounting a resource? But the manual says the
  file system mount require additional time if the file system was not
  cleanly unmounted.

It will require more time for a journal replay, I guess.

 
  - Check if the file system is in use by another OSS through MMP and wait
  a litte bit longer? How do I do this?

Not necessary. All DDN Lustre installations in Europe are now based on 
pacemaker, without any ugly workarounds. But then as I told you before, our 
releases also fix bug 19566 already.

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] lustre 1.6.7.2 client kernel panic

2009-12-21 Thread Bernd Schubert

On Monday 21 December 2009, Andreas Dilger wrote:
 On 2009-12-21, at 11:15, Nick Jennings wrote:
  I had another instance of the client kernel panic which I first
  encountered a few months ago. This time I managed to get a shot of the
  console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the
  JPG is from the console of wsn1(client).
 
 I see bug 19841, which has at least part of this stack
 (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614.  The
 latter bug is marked landed for 1.8.0 and later releases.

Nick, if you do not want to upgrade or patch your Lustre version, the 
workaround for this is to disable lockless truncates. 


# on all clients
for i in /proc/fs/lustre/llite/*; do
echo 0  ${i}/lockless_truncate;
done


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] async journals

2009-12-17 Thread Bernd Schubert

Hello,

I'm presently a bit puzzled about asynchronous journal patches.

While I was just reading the jbd-journal-chksum-rhel53.patch patch, I noticed 
it also adds a new option and feature journal_async_commit.

But then ever since lustre-1.8.0 there is also a patch included for async 
journals from obdfilter. This patch is presently disabled, since it could 
cause data corruption on fail over.

I now wonder how these two patches/features are related, so jbd/ldiskfs/ext4 
(journal_async_commit vs. obdfilter (obdfilter.*.sync_journal=0).

When I did some tests with lctl set_param obdfilter.*.sync_journal=0, it even 
slightly reduced performance. So I wonder if one additionally needs to enable 
jbd-async journals?


Thanks,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] OST I/O problems

2009-12-04 Thread Bernd Schubert

On Friday 04 December 2009, Heiko Schröter wrote:
 Hello,
 
 we do see those messages (see below) on our OSTs when under heavy _read_
  load (or when 60+ Jobs are trying to read data at approx the same time).
  The OSTs freezes and even console output is down to a few bytes the
  minute. After some time the OSTs do revocer.

ler.c:882:ost_brw_read()) @@@ timeout on bulk PUT after
  100+0s  r...@81007efa7e00 x7869690/t0

This error message means You have a flaky network. For example it comes up 
if you set a high MTU, but your switch does not support it.


Cheers,
Bernd
 

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] how to define 60 failnodes

2009-11-09 Thread Bernd Schubert

On Monday 09 November 2009, Brian J. Murrell wrote:
 
 Theoretically.  I had discussed this briefly with another engineer a
 while ago and IIRC, the result of the discussion was that there was
 nothing inherent in the configuration logic that would prevent one from
 having more than two (primary and failover) OSSes providing service
 to an OST.  Two nodes per OST is how just about everyone that wants
 failover configures Lustre.

Not everyone ;) And especially it doesn't make sense to have a 2 node failover 
scheme with pacemaker:

https://bugzilla.lustre.org/show_bug.cgi?id=20964



-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Support for vanilla kernels in lustre servers

2009-10-31 Thread Bernd Schubert

On Saturday 31 October 2009, Mag Gam wrote:
 if I were to deploy a system now and I want to do the kernel compile
 way, what kernel do you recommend? I prefer using 1.6.7.2 because of
 its stability...

Sun is very helpful and provide distribution kernels as tar.bz2 on 
their download page:

http://downloads.lustre.org/public/kernels/

So instead of going through the pain to get that yourself from the 
vendors src.rpm Sun already greatly helps (so far I have not 
found an easy how to do that myself, any hint from the guy 
providing the tar files would be highly appreciated).

In a perfect world, this page also would state which kernel is suitable 
for which Lustre version, e.g.

lustre-1.6.7.2  linux-2.6.18-92.1.10.el5.tar.bz2
lustre-1.8.1.1  linux-2.6.18-128.7.1.el5.tar.bz2


Also missing are the .config files. I usually extract these from the kernel 
binary packages - I need to download 150MB to get the 4KB config file *sigh*. 
And better don't try to change options in non-vanilla kernels, this very often 
fails, because the vendor doesn't support it and so also doesn't test it.

Btw, of course these kernels also work for different distributions, so instead 
of going through the pain to port Lustre to  Ubuntu or Debian kernels, I simply 
started to create debian packages for the RHEL5 kernels

http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/lustre/1.6/debs/lustre-clients/1.6.7.2-ddn2/
linux-image-2.6.18-128.7.1.el5_1_amd64.deb
linux-headers-2.6.18-128.7.1.el5_1_amd64.deb


Cheers,
Bernd


PS: Disclaimer: Whatever packages you may find on my home page, I won't 
provide support for these!

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] intrepid kernel on jaunty

2009-10-29 Thread Bernd Schubert

On Thursday 29 October 2009, Ralf Utermann wrote:
 Papp Tamás schrieb:
  Papp Tamás wrote, On 2009. 10. 28. 22:01:
  Brian J. Murrell wrote, On 2009. 10. 28. 20:26:
  Additionally, b1_8 (aka 1.8.2) also has debian packaging support (on an
  unsupported basis at this point) in /debian.  Here are the installed
  packages, built from the /debian in b1_8 from my client:
 
  ii  lustre-client-modules-2.6.28-11-generic1.8.1.50-2   Lustre
  Linux kernel module (kernel 2.6.28-11 ii  lustre-tests 
   1.8.1.50-2   Test suite for the Lustre filesystem ii 
  lustre-utils   1.8.1.50-2   Userspace
  utilities for the Lustre filesyste
 
  Somehow I could make the debs, but not the modules, only these:
 
  liblustre_1.8.1.50-1_amd64.deb
  lustre-dev_1.8.1.50-1_amd64.deb
  lustre-tests_1.8.1.50-1_amd64.deb
  linux-patch-lustre_1.8.1.50-1_all.deb
  lustre-source_1.8.1.50-1_all.deb
  lustre-utils_1.8.1.50-1_amd64.deb
 
 then install lustre-source_1.8.1.50-1_all.deb, and build the modules
 using  'm-a  build lustre ' . You might need to chmod +x
  /usr/src/modules/lustre/debian/rules .


I might be the only one complaining, but I think the method I proposed with 
attachment 24529 (https://bugzilla.lustre.org/attachment.cgi?id=24529) was 
more convenient, as least when you just want to build packages, e.g. in a 
chroot. Just autotools support was missing to automatically do the sed steps.
And no, when Brian was working on this, I really didn't have the time to step 
in, I had been already far above 16 hours per day that time.


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] more on MGS and MDT separation

2009-10-24 Thread Bernd Schubert

On Thursday 22 October 2009, Ms. Megan Larko wrote:
 Greetings,
 
 I am deactivating an older Lustre filesystem in favor of a newer one
 (already up and stable).
 The message in Lustre-discuss Digest Vol 45 Issue 22 stated (with two
 of my comments in-line):
 
 Message: 2
 Date: Sun, 11 Oct 2009 19:05:07 +0200
 From: Bernd Schubert bs_li...@aakef.fastmail.fm
 Subject: Re: [Lustre-discuss] Moving MGS to separate device
 To: lustre-discuss@lists.lustre.org
 Message-ID: 200910111905.08076.bs_li...@aakef.fastmail.fm
 Content-Type: Text/Plain;  charset=iso-8859-15
 
 Hello Wojciech,
 
 I already did this several times, here are the steps I so far used:
 
 1) Remove MGS from MDT-device
 
 tunefs.lustre --nogms /dec/mdt_device
 
   Megan: I am assuming --nomgs  here.

Yes, sorry a typo.

 
 2) Create new MGS
 
 mkfs.lustre --mgs /dev/mgs_device
 
 3) Make sure OSTs and  MDTs re-register with the MGS:
 
 tunefs.lustre --writeconf  /dev/device
 
  Megan:  Do I need to do this even if the MGS is being moved
 from a shared device with an MDT to its own device/hard drive on the
 same physical server (same MAC addr, IP, hostname etc.)?

I did it to make sure MDT and OSTs re-register with the MGS, so to make sure 
the MGS really knows about them. In the end the MGS is only there to know 
about OSTs and MDTs and to provide those information to clients. I think on 
mounting of MDTs and OSTs they always contact the MGS, but if the MGS is not 
available they will give up, so it **might** happen that they wouldn't contact 
the MGS and so wouldn't be registered. --writeconf will ensure that.
As I also wrote, it might work to simply copy the CONFIGS directory, but I 
didn't test that yet.

 
 I'm not sure if writeconf is really necessary, but so far I always did it
  to make sure everything goes smoothly (clients shouldn't have the
  filesystem mounted at this step).
 
 5) Mount MGS, MDT, OSTs
 
 4) Re-apply settings done with lctl.
 
  Megan: Why are the above ordered the way they are?  Shouldn't
 I mount first and then apply the settings?   (I didn't think I could
 lctl an unmounted OST/MDT etc.)

And another typo, actually even two. lctl settings can be applied only with 
the filesystem being mounted. So 

4) Mount MGS, MDT, OSTs

5) Re-apply settings done with lctl.



Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] 1.8.1 test setup achieved, what about maximum mdt size

2009-10-23 Thread Bernd Schubert

On Tuesday 20 October 2009, Andreas Dilger wrote:
 On 18-Oct-09, at 16:04, Piotr Wadas wrote:
  Now, I did a simple count of MDT size as described in lustre 1.8.1
  manual,
  and setup mdt as recommended. The question is, no matter I did right
  count
  or not, what actually will happen, if MDT partition runs out of space?
  Any chances to dump the whole MGS+MDT combined fs, supply a bigger
  block
  device, or extend partition size with some e2fsprogs/tune2fs trick ?
  This assumes, that no matter how big MDT is, it will be exhausted
  someday.
 
 It is true that the MDT device can become full at some point, but this
 happens fairly rarely given that most Lustre HPC users have very large
 files, and the size of the MDT is MUCH smaller than the space needed for
 the file data.  The maximum size of MDT is 8TB, and if you format the

Is that still true with recent kernels such as the one from SLES11? I thought 
ldiskfs is based on ext4 there? So we should have at least 16TiB and I'm not 
sure if all the e2fsprogs patches already have been landed to get 64-bit max 
sizes?


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Understanding of MMP

2009-10-19 Thread Bernd Schubert

On Monday 19 October 2009, Andreas Dilger wrote:
 On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
  perhaps I have a problem understanding multiple mount protection
  MMP. I have a
  cluster. When a failover happens sometimes I get the log entry:
 
  Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
  ldiskfs_multi_mount_protect: Device is already active on another node.
  Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
  ldiskfs_multi_mount_protect: MMP failure info: last update time:
  1255958168,
  last update node: sososd3, last update device: dm-2
 
  Does the second line mean that my node (sososd7) tried to mount /dev/
  dm-2 but
  MMP prevented it from doing so because the last update from the old
  node
  (sososd3) was too recent?
 
 The update time stored in the MMP block is purely for informational
 purposes.  It actually uses a sequence counter that has nothing to do
 with the system clock on either of the nodes (since they may not be in
 sync).
 
 What that message actually means is that sososd7 tried to mount the
 filesystem on dm-2 (which likely has another LVM name that the kernel
 doesn't know anything about) but the MMP block on the disk was modified
 by sososd3 AFTER sososd7 first looked at it.

Probably, bug#19566. Michael, which Lustre version do you exactly use?


Thanks,
Bernd


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Understanding of MMP

2009-10-19 Thread Bernd Schubert

On Monday 19 October 2009, Michael Schwartzkopff wrote:
 Am Montag, 19. Oktober 2009 20:42:19 schrieben Sie:
  On Monday 19 October 2009, Andreas Dilger wrote:
   On 19-Oct-09, at 08:46, Michael Schwartzkopff wrote:
perhaps I have a problem understanding multiple mount protection
MMP. I have a
cluster. When a failover happens sometimes I get the log entry:
   
Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device dm-2):
ldiskfs_multi_mount_protect: Device is already active on another
node. Oct 19 15:16:08 sososd7 kernel: LDISKFS-fs warning (device
dm-2): ldiskfs_multi_mount_protect: MMP failure info: last update
time: 1255958168,
last update node: sososd3, last update device: dm-2
   
Does the second line mean that my node (sososd7) tried to mount /dev/
dm-2 but
MMP prevented it from doing so because the last update from the old
node
(sososd3) was too recent?
  
   The update time stored in the MMP block is purely for informational
   purposes.  It actually uses a sequence counter that has nothing to do
   with the system clock on either of the nodes (since they may not be in
   sync).
  
   What that message actually means is that sososd7 tried to mount the
   filesystem on dm-2 (which likely has another LVM name that the kernel
   doesn't know anything about) but the MMP block on the disk was modified
   by sososd3 AFTER sososd7 first looked at it.
 
  Probably, bug#19566. Michael, which Lustre version do you exactly use?
 
 
  Thanks,
  Bernd
 
 I got version 1.8.1.1 which was published last week. Is the fix included or
 only in 1.8.2?

According to the bugzilla (https://bugzilla.lustre.org/show_bug.cgi?id=19566) 
not yet in 1.8.1.1. Our ddn internal releases of course do have it. And from 
my point of view this is a really important fix. Ever since 1.6.7 there is 
also no chance anymore to figure out the unsuccessful umount from the resource 
agent (up to 1.6.6 /proc/fs/lustre/.../mntdev would tell you the device is 
still mounted).

To be sure this really your issue, do you see this in your kernel logs?

CERROR(Mount %p is still busy (%d refs), giving up.\n,
   mnt, atomic_read(mnt-mnt_count));



-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Problem re-mounting Lustre on an other node

2009-10-14 Thread Bernd Schubert

On Wednesday 14 October 2009, Michael Schwartzkopff wrote:
 Hi,
 
 we have a Lustre 1.8 Cluster with openais and pacemaker as the cluster
 manager. When I migrate one lustre resource from one node to an other node
  I get an error. Stopping lustre on one node is no problem, but the node
  where lustre should start says:
 
 Oct 14 09:54:28 sososd6 kernel: kjournald starting.  Commit interval 5
  seconds Oct 14 09:54:28 sososd6 kernel: LDISKFS FS on dm-4, internal
  journal Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: recovery complete.
 Oct 14 09:54:28 sososd6 kernel: LDISKFS-fs: mounted filesystem with ordered
 data mode.
 Oct 14 09:54:28 sososd6 multipathd: dm-4: umount map (uevent)
 Oct 14 09:54:39 sososd6 kernel: kjournald starting.  Commit interval 5
  seconds Oct 14 09:54:39 sososd6 kernel: LDISKFS FS on dm-4, internal
  journal Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mounted filesystem
  with ordered data mode.
 Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: file extents enabled
 Oct 14 09:54:39 sososd6 kernel: LDISKFS-fs: mballoc enabled
 Oct 14 09:54:39 sososd6 kernel: Lustre: mgc134.171.16@tcp: Reactivating

[...]

 
 These log continue until the cluster software times out and the resource
  tells me about the error. Any help understanding these logs? Thanks.
 

What is your start timeout? Do you see mount in the process list? I guess you 
just need to increase the timeout, I usually set at least 10 minutes, 
sometimes even 20 minutes. Also see my bug report and if possible add further 
information yourself.

https://bugzilla.lustre.org/show_bug.cgi?id=20402



Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Problem re-mounting Lustre on an other node

2009-10-14 Thread Bernd Schubert

On Wednesday 14 October 2009, Michael Schwartzkopff wrote:

 
 We have timeouts of 60 seconds. But we will move to 300. Thanks for the
  hint.

Check out my bug report, that might not be sufficient.



-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?

2009-10-13 Thread Bernd Schubert

On Tuesday 13 October 2009, Andreas Dilger wrote:
 On 12-Oct-09, at 12:11, Lundgren, Andrew wrote:
  I have tried using:
 
  # lctl conf_param content-MDT.osc.lru_size=800
 
  Seen this in the log:
  Oct 12 18:35:36 abcd0202 kernel: Lustre: Modifying parameter content-
  MDT-mdc.osc.lru_size in log content-client
  Oct 12 18:35:36 abcd0202 kernel: Lustre: Skipped 1 previous similar
  message
 
  But then on the clients, the lru_size doesn't seem to change:
 
  OSS # cat ./fs/lustre/ldlm/namespaces/*/lru_size
  33
  0
  0
  0
  1
  0
  1
  200
 
  I have also set it for the OST individually from the MDS.  It
  doesn't seem to do anything for the other machines.
 
  Is this a permanently tunable parameter, or am I just specifying the
  wrong setting?
 
 My apologies.  Any parameter settable in a /proc/fs/lustre/ file can
 usually
 be specified as obd|fsname.obdtype.proc_file_name=value, e.g.:


Thanks, I think this should go into the the man page of lctl.


 
* tunefs.lustre --param mdt.group_upcall=NONE /dev/sda1
* lctl conf_param testfs-MDT.mdt.group_upcall=NONE
* lctl conf_param testfs.llite.max_read_ahead_mb=16
* ... testfs-MDT.lov.stripesize=2M
* ... testfs-OST.osc.max_dirty_mb=29.15
* ... testfs-OST.ost.client_cache_seconds=15
* ... testfs.sys.timeout=40
 
 However, it isn't currently possible to specify a conf_param tunable
 for ldlm
 settings, since they do not have their own OBD device and the tunable
 code is
 (unfortunately) slightly different than other parts of the Lustre proc
 tunables.
 
 Ages and ages ago, this was done because the externally-contributed
 lprocfs code
 was very buggy and we wanted to make sure that the ldlm /proc tunables
 (which
 were, at the time, the only ones that were actually required for Lustre
 functionality) would continue working while lprocfs was disabled until
 fixed.
 
 Until now, there was no reason to change that code, but it makes sense
 to fix
 that now...  Could you file a bug on this?

Done, bug 21084

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?

2009-10-13 Thread Bernd Schubert

On Saturday 10 October 2009, Andreas Dilger wrote:
 On 8-Oct-09, at 22:28, Lundgren, Andrew wrote:
  Is there a way to set the lru_size to a fixed value and have it stay
  that way across mounts?
 
  I know it can be set using:
  $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
  But that isn’t retained across a reboot.
 
 lctl set_param is only for temporary tunable setting.  You can use
 lctl conf_param to set a permanent tunable.

Would you mind to provide an example line? I never understood the logic of 
lctl conf_param. This fails:

lctl conf_param ldlm.namespaces.testfs-OST0002-osc.lru_size=800
error: conf_param: Invalid argument



Thanks,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Setup mail cluster

2009-10-13 Thread Bernd Schubert

On Monday 12 October 2009, Michael Schwartzkopff wrote:
 Am Montag, 12. Oktober 2009 15:54:04 schrieb Vadym:
  Hello
  I'm do a schema of mail service so I have only one question:
  Can Lustre provide me full automatic failover solution?
 
 No. See the lustre manual for this. You need a cluster solution for this.
 The manual is *hopelessly* outdate at this point. Do NOT user heartbeat any
 more. User pacemaker as the cluster manager. See www.clusterlabs.org.
 
 When I find some time I want to write a HOWTO about setting up a lustre
  clsuter with pacemaker and OpenAIS.

Also see bug 20807 (https://bugzilla.lustre.org/show_bug.cgi?id=20807) for a 
pacemaker agent.


Cheers,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?

2009-10-12 Thread Bernd Schubert

On Saturday 10 October 2009, Andreas Dilger wrote:
 On 8-Oct-09, at 22:28, Lundgren, Andrew wrote:
  Is there a way to set the lru_size to a fixed value and have it stay
  that way across mounts?
 
  I know it can be set using:
  $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
  But that isn’t retained across a reboot.
 
 lctl set_param is only for temporary tunable setting.  You can use
 lctl conf_param to set a permanent tunable.

Would you mind to provide an example line? I never understood the logic of 
lctl conf_param. This fails:

lctl conf_param ldlm.namespaces.testfs-OST0002-osc.lru_size=800
error: conf_param: Invalid argument


And this as well

lctl conf_param testfs-MDT.ldlm.namespaces.testfs-OST0002-osc.lru_size=800


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Setup mail cluster

2009-10-12 Thread Bernd Schubert

On Monday 12 October 2009, Michael Schwartzkopff wrote:
 Am Montag, 12. Oktober 2009 15:54:04 schrieb Vadym:
  Hello
  I'm do a schema of mail service so I have only one question:
  Can Lustre provide me full automatic failover solution?
 
 No. See the lustre manual for this. You need a cluster solution for this.
 The manual is *hopelessly* outdate at this point. Do NOT user heartbeat any
 more. User pacemaker as the cluster manager. See www.clusterlabs.org.
 
 When I find some time I want to write a HOWTO about setting up a lustre
  clsuter with pacemaker and OpenAIS.

Also see bug 20807 (https://bugzilla.lustre.org/show_bug.cgi?id=20807) for a 
pacemaker agent.


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Moving MGS to separate device

2009-10-11 Thread Bernd Schubert

Hello Wojciech,

I already did this several times, here are the steps I so far used:

1) Remove MGS from MDT-device

tunefs.lustre --nogms /dec/mdt_device

2) Create new MGS

mkfs.lustre --mgs /dev/mgs_device

3) Make sure OSTs and  MDTs re-register with the MGS:

tunefs.lustre --writeconf  /dev/device

I'm not sure if writeconf is really necessary, but so far I always did it to 
make sure everything goes smoothly (clients shouldn't have the filesystem 
mounted at this step).

5) Mount MGS, MDT, OSTs

4) Re-apply settings done with lctl.


As you also wrote (private mail), it might be possible to just copy over the 
CONFIGS directory, but I never tried to do that.



Hope it helps,
Bernd



On Saturday 10 October 2009, Wojciech Turek wrote:
 Hi,
 
 I am very interested in finding out how to move co-located MGS to separate
 disk. I will be moving my MDTs to new hardware soon and I would like to
 separate MGS from MDT. I will be grateful for some info on this subject
 please.
 
 Many thanks,
 
 Wojciech
 
 2008/6/23 Andreas Dilger adil...@sun.com
 
  On Jun 17, 2008  12:40 -0700, Klaus Steden wrote:
   I have a question ... if the MGS is used so infrequently relative to
   the
 
  use
 
   of the MDS, why is it (is it?) problematic to locate it on the same
 
  volume
 
   as the MDT?
 
  If you have multiple MDTs on the same MDS node (i.e. multiple Lustre
  filesystems) then it is difficult to start up the MGS separately from
  the MDT if it is co-located with one of the MDTs.  It isn't impossible
  (with some manual mounting of the underlying filesystems) to move a
  co-located MGS to a separate filesystem if needed.
 
  Cheers, Andreas
  --
  Andreas Dilger
  Sr. Staff Engineer, Lustre Group
  Sun Microsystems of Canada, Inc.
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

2009-10-11 Thread Bernd Schubert

Hello Wojciech,

bug 17485 has patch, that has landed in 1.8 to prevent that duplicate 
references to OST objects come up after MDS fail over. But if you create 
duplicate entries yourself, it won't help of course. Bug 20412 has such a 
valid use case for duplicate MDT files and also lots of patches for lfsck, 
since the default way to fix such issues wasn't suitable for us.

Hmm, but 18748 came up, when I tested at CIEMAT lustre-1.6.7 + patch from bug 
17485 and somehow filesystem corruption came up. I'm still not sure what was 
the main culprit for the corruption - either the initial patch of bug 17485 or 
the MDS issue with 1.6.7. Unfortunately I still didn't get a test system with 
at least 100 clients to reproduce the test.
So in principle you shouldn't run into it, at least not with corrupted 
objects. I guess it will be fixed, once you fix the filesystem with e2fsck and 
lfsck. I'm only surprised that vanilla 1.6.6 works for you, it has so many 
bugs...



Cheers,
Bernd



On Sunday 11 October 2009, Wojciech Turek wrote:
 Hi Bernd,
 
 Many thanks for your reply. I have found this bug last night and as far as
  I can see there is no fix for it yet? I am preparing dbs to run lfsck on
  affected file systems. I also found bug 18748 and I must say we have
  exactly the same problems. It just looks like we run into that problem few
  months after CIEMAT did. As far as I know if we can see this message it
  means that there are files with missing objects. The worst is that we
  don't know when and why files looses they objects. It just happens
  spontaneously and there isn't any lustre messages that could give us a
  clue. Users run jobs and some time after their files were written some of
  these files get corrupted/looses objects (?-) trying to access this
  files for the first time triggers 'lvbo' message.
 We have third lustre file system which runs on different hardware but the
 same lustre version and RHEL version as the affected ones. I can not see
  any problems on the third file system.
 
 Wojciech
 
 2009/10/10 Bernd Schubert bs_li...@aakef.fastmail.fm
 
  ASSERTION(old_inode-i_state  I_FREEING) is the infamous bug17485. You
  will
  need to run lfsck to fix it.
 
  On Saturday 10 October 2009, Wojciech Turek wrote:
   Hi,
  
   Did you get to the bottom of this?
  
   We are having exactly the same problem with our lustre-1.6.6 (rhel4)
 
   file
 
   systems. Recently it got worst and MDS crashes quite frequently, when
   we run e2fsck there are errors that are being fixed. However after some
 
  time
 
we still are seeing  the same errors in the logs about missing objects
 
  and
 
files get corrupted (?---) Also clients LBUGs quite frequently
with this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
   This looks like serious lustre problem but so far I didn't find any
   clues on that even after long search through lustre bugzilla.
  
   Our MDSs and OSSs are UPSed, RAID is behaving OK, we don't see any
   errors in the syslog.
  
   I will be grateful for some hints on this one
  
   Wojciech
  
   2009/8/24 rishi pathak mailmaverick...@gmail.com
  
Hi,
   
Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover.
 
  Client
 
as well as servers run lustre-1.6 and kernel 2.6.9-18.
   
   Doing a ls -ltr for a directory in lustre fs throws following
errors (as got from lustre logs) on client
 
  0008:0002:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set
 
   _data_with_check()) ### inconsistent l_ast_data found ns:
scratch-OST0005-osc-81201e8dd800 lock: 811f9af04
000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2
 
  type:
EXT [0-18446744073709551615] (req 0-18446744073709551615) flags:
 
  10
 
remote: 0xb79b445e381bc9e6 expref: -99 p
id: 22878
 
  0008:0004:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set
 
   _data_with_check()) ASSERTION(old_inode-i_state  I_FREEING)
 
  failed:Found
 
existing inode 811f2cf693b8/1972725
44/1895600178 state 0 in lock: setting data to
8118ef8ed5f8/207519777/1771835328
 
  :0004:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set
 
   _data_with_check()) LBUG
   
   
On scratch-OST0005 OST it shows
   
Aug 24 10:22:53 yn266 kernel: LustreError:
3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resour ce 569204: rc -2
Aug 24 10:22:53 yn266 kernel: LustreError:
3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous
similar messages
Aug 24 12:40:43 yn266 kernel: LustreError:
2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resour ce 569195: rc -2
Aug 24 12:44:59 yn266 kernel: LustreError:
2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resour ce 569198: rc -2
   
These kind of errors we are getting for many clients.
   
##History ##
Prior

Re: [Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

2009-10-10 Thread Bernd Schubert

ASSERTION(old_inode-i_state  I_FREEING) is the infamous bug17485. You will 
need to run lfsck to fix it.


On Saturday 10 October 2009, Wojciech Turek wrote:
 Hi,
 
 Did you get to the bottom of this?
 
 We are having exactly the same problem with our lustre-1.6.6 (rhel4)  file
 systems. Recently it got worst and MDS crashes quite frequently, when we
  run e2fsck there are errors that are being fixed. However after some time
  we still are seeing  the same errors in the logs about missing objects and
  files get corrupted (?---) Also clients LBUGs quite frequently
  with this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
 This looks like serious lustre problem but so far I didn't find any clues
  on that even after long search through lustre bugzilla.
 
 Our MDSs and OSSs are UPSed, RAID is behaving OK, we don't see any errors
  in the syslog.
 
 I will be grateful for some hints on this one
 
 Wojciech
 
 2009/8/24 rishi pathak mailmaverick...@gmail.com
 
  Hi,
 
  Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client
  as well as servers run lustre-1.6 and kernel 2.6.9-18.
 
 Doing a ls -ltr for a directory in lustre fs throws following
  errors (as got from lustre logs) on client
 
  0008:0002:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set
 _data_with_check()) ### inconsistent l_ast_data found ns:
  scratch-OST0005-osc-81201e8dd800 lock: 811f9af04
  000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type:
  EXT [0-18446744073709551615] (req 0-18446744073709551615) flags: 10
  remote: 0xb79b445e381bc9e6 expref: -99 p
  id: 22878
  0008:0004:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set
 _data_with_check()) ASSERTION(old_inode-i_state  I_FREEING) failed:Found
  existing inode 811f2cf693b8/1972725
  44/1895600178 state 0 in lock: setting data to
  8118ef8ed5f8/207519777/1771835328
  :0004:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set
 _data_with_check()) LBUG
 
 
  On scratch-OST0005 OST it shows
 
  Aug 24 10:22:53 yn266 kernel: LustreError:
  3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
  resour ce 569204: rc -2
  Aug 24 10:22:53 yn266 kernel: LustreError:
  3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous
  similar messages
  Aug 24 12:40:43 yn266 kernel: LustreError:
  2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
  resour ce 569195: rc -2
  Aug 24 12:44:59 yn266 kernel: LustreError:
  2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
  resour ce 569198: rc -2
 
  These kind of errors we are getting for many clients.
 
  ##History ##
  Prior to thsese occurences, our MDS showed signs of failure in way that
  cpu load was shooting above 100 (on a quad core quad socket system) and
  users were complaining about slow storage performance. We took it offline
  and did fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it
  showed some errors which were fixed. For data integrity check, mdsdb and
  ostdb were built and lfsck was run on a client(client was mounted with
  abort_recov).
 
  lfsck was run in following order:
  lfsck with no fix - reported dangling inodes and orphaned objects
  lfsck with -l (backup orphaned objects)
  lfsck with -d and -c (delete orphaned objects and create missing OST
  objects referenced by MDS)
 
  After above operations, on clients we were seeing file in red and
  blinking. Doing a stat came out with an error stating 'no such file or
  directory'.
 
  My question is whether the order in which lfsck was run (should lfsck be
  run multiple times) and  the errors we are getting are related or not.
 
 
 
 
  --
  Regards--
  Rishi Pathak
  National PARAM Supercomputing Facility
  Center for Development of Advanced Computing(C-DAC)
  Pune University Campus,Ganesh Khind Road
  Pune-Maharastra
 
  ___
  Lustre-discuss mailing list
  Lustre-discuss@lists.lustre.org
  http://lists.lustre.org/mailman/listinfo/lustre-discuss
 


-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?

2009-10-09 Thread Bernd Schubert

On Friday 09 October 2009, Lundgren, Andrew wrote:
 Is there a way to set the lru_size to a fixed value and have it stay that
  way across mounts?
 
 I know it can be set using:
 $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
 But that isn't retained across a reboot.

Even worse, if for some reason, e.g. evictions connection to OSTs get lost, it 
will also also reset to default. We are for now compiling our packages LRU 
disabled.

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] 1.8 download link outdated

2009-07-27 Thread Bernd Schubert

Hello 

this link still points to the alpha version, I guess it better should be 
redirected as v1.6:

http://downloads.lustre.org/public/lustre/v1.8/


Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] mds server crashing

2009-03-15 Thread Bernd Schubert

Hello Mag,

sorry for my late reply. I think there is a misunderstanding. The bug I'm 
talking about is if you export Lustre by knfsd. It is not important if you do 
use any other NFS services on your MDS/OSS system. But if you should export 
Lustre by NFS using the kernel export nfs daemon, try to disable that.


Cheers,
Bernd

On Sunday 15 March 2009, Mag Gam wrote:
 This happened again :-(

 Basically, there is a process called ll_mdt30 which is taking up
 100% of the CPU. I am not sure what its doing but I can't even reboot
 the system. I have to hard reboot.

 Also, I checked my other OSTs and MDS and I don't have anything
 special for NFS in /etc/modules.conf

 On Sat, Mar 14, 2009 at 8:35 AM, Mag Gam magaw...@gmail.com wrote:
  Hey Bernd:
 
  Thanks for the reply.
 
  Interesting. We are using with NFS too. Is there something in
  particular we need to do like enable port 988 in /etc/modules.conf
  which I think I am already doing.
 
  Any chance you can send traces with line wrap disabled? With line
  wrapping it is quite hard to read.
 
  Ofcourse! I even posted a bug report with the /tmp/lustre.log
  https://bugzilla.lustre.org/show_bug.cgi?id=18802
 
  Let me know if you need anything else.
 
  TIA
 
 
 
  On Sat, Mar 14, 2009 at 7:35 AM, Bernd Schubert
 
  bernd.schub...@fastmail.fm wrote:
  On Saturday 14 March 2009, Mag Gam wrote:
  We are having a problem with a MDS server (which also has 1 OST) on the
  box.
 
  When the server boots up, we notice there is an ll_mdt process running
  at 100% and we keep on waiting close  to 10-15 mins. We only have 8
  clients. (I assume this normal recovery process). However if I
  manually mount up the mdt without any recovery everything is fine
 
  Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be
  fixed in 1.6.5. Although I'm not sure, since we switched NFS exports to
  unfs3 ever since the problem came up.
 
  Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10
  Tainted: G  2.6.18-92.1.17.el5_lustre.1.6.7smp #1
  Mar 12 10:11:02 protected_host_01 kernel: RIP:
  0010:[888ed8df]  [888ed8df]
 
  :ldiskfs:do_split+0x3ef/0x560
 
  Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:8103d2a5f460
  EFLAGS: 0216
  Mar 12 10:11:02 protected_host_01 kernel: RAX:  RBX:
  0080 RCX: 
  Mar 12 10:11:02 protected_host_01 kernel: RDX: 0080 RSI:
  8103cd52177c RDI: 8103cd52176c
 
  Any chance you can send traces with line wrap disabled? With line
  wrapping it is quite hard to read.
 
 
  Cheers,
  Bernd


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] mds server crashing

2009-03-15 Thread Bernd Schubert

On Saturday 14 March 2009, Mag Gam wrote:
 We are having a problem with a MDS server (which also has 1 OST) on the
 box.

 When the server boots up, we notice there is an ll_mdt process running
 at 100% and we keep on waiting close  to 10-15 mins. We only have 8
 clients. (I assume this normal recovery process). However if I
 manually mount up the mdt without any recovery everything is fine

Hmm, I have seen that with 1.6.4.3 and NFS exports. But that should be fixed 
in 1.6.5. Although I'm not sure, since we switched NFS exports to unfs3 ever 
since the problem came up.


 Mar 12 10:11:02 protected_host_01 kernel: Pid: 10375, comm: ll_mdt_10
 Tainted: G  2.6.18-92.1.17.el5_lustre.1.6.7smp #1
 Mar 12 10:11:02 protected_host_01 kernel: RIP:
 0010:[888ed8df]  [888ed8df]

 :ldiskfs:do_split+0x3ef/0x560

 Mar 12 10:11:02 protected_host_01 kernel: RSP: 0018:8103d2a5f460
 EFLAGS: 0216
 Mar 12 10:11:02 protected_host_01 kernel: RAX:  RBX:
 0080 RCX: 
 Mar 12 10:11:02 protected_host_01 kernel: RDX: 0080 RSI:
 8103cd52177c RDI: 8103cd52176c

Any chance you can send traces with line wrap disabled? With line wrapping it 
is quite hard to read.


Cheers,
Bernd

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] source download doesn't work

2009-03-02 Thread Bernd Schubert

Hello,

ever since the end of last week I try to download the sources of 1.6.7, but I 
always get:


We are sorry ...
General Error
We are sorry, but the download system cannot process your request at this 
time. Please try again later.
If the problem persists, please report it to Customer Service.


I'm going to report to the Customer Service now and also try to find the 
proper cvs branch.


Thanks,
Bernd
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

1 2 >

1 - 100 of 125 matches

Mail list logo