[lustre-discuss] Collecting/analyzing logs for client evictions related lock timeouts

2020-12-22 Thread Oleg Drokin
Hello! There’s been some inrush of tickets in recent times in Jira about clients being evicted for being unresponsive to lock timeouts which is just a symptom for potentially a lot of different things. Having gone through several of those. I did a writeup on what logs are needed

Re: [lustre-discuss] Meaning of 'slow creates' messages on MDS

2017-05-30 Thread Oleg Drokin
On May 28, 2017, at 3:09 PM, Russell Dekema wrote: > Greetings, > > We have been having various kinds of trouble with our Lustre > filesystem lately; right now the main problem we are having is > intermittent severe slowness (such as 30 seconds for an 'ls' of a > directory containing 100 files

Re: [lustre-discuss] lost files on ZFS

2016-11-06 Thread Oleg Drokin
Hello! On Oct 30, 2016, at 8:33 AM, Thomas Roth wrote: > Hi all, > > we have a larger amount of files that give ??? on 'ls' and the error "Cannot > allocate memory" > The corresponding error on the OSS is > "lvbo_init failed for resource ... rc = -2" > > This seems similar to LU-5457

Re: [lustre-discuss] Filesystem hanging....

2016-08-26 Thread Oleg Drokin
On Aug 14, 2016, at 1:13 PM, Phill Harvey-Smith wrote: > On 14/08/2016 03:09, Stephane Thiell wrote: >> Hi Phil, > > Phill :) > >> I understand that you’re running master on your clients (tag v2_8_56 >> was created 4 days ago) and 2.1 on the servers? Running master in >> production is already

Re: [lustre-discuss] Does an updated version exist?

2016-08-26 Thread Oleg Drokin
On Aug 16, 2016, at 6:55 AM, E.S. Rosenberg wrote: > I just found this paper: > http://wiki.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf > > It looks interesting but it deals with lustre 1.6 so I am not sure how > relevant it still is. Well, I believe it deals with

Re: [Lustre-discuss] question about dcache revalidate

2012-01-19 Thread Oleg Drokin
the case, the check in d_compare was added a few years later and before then it was perfectly possible to find invalid dentries and that's why we had this d_revalidate check hitting. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc

Re: [Lustre-discuss] question about dcache revalidate

2012-01-12 Thread Oleg Drokin
___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Oleg

Re: [Lustre-discuss] Finding bugs in Lustre with Coccinelle

2012-01-12 Thread Oleg Drokin
of smaller parts and attach it to http://bugs.whamcloud.com/browse/LU-871, the bug that I previously opened to track defects found by Clang/LLVM. Also would be great to run something like this on 2.2 codebase, I guess. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc

Re: [Lustre-discuss] Lustre v2.1 RHEL 6.1 build does not work

2011-12-07 Thread Oleg Drokin
@gmail.com On Sun, Oct 2, 2011 at 4:31 PM, Jon Zhu jon@gmail.com wrote: Thanks a lot, the work around works. -Jon. On Sun, Oct 2, 2011 at 3:47 PM, Oleg Drokin gr...@whamcloud.com wrote: Hello! Last time I hit this (some years ago), a simple touch ldiskfs

Re: [Lustre-discuss] Lustre v2.1 RHEL 6.1 build does not work

2011-10-02 Thread Oleg Drokin
' make: *** [rpms] Error 2 Thanks, -Jon. On Thu, Sep 29, 2011 at 11:34 PM, Oleg Drokin gr...@whamcloud.com wrote: Hello! There is nothing special, same as rhel6.1: unpack the lustre source, run autogen.sh, run configure and provide the path to the linux kernel source for your

Re: [Lustre-discuss] Lustre v2.1 RHEL 6.1 build does not work

2011-09-29 Thread Oleg Drokin
a procedure on how to build v2.1 GA code on CentOS 5.6 (xen)? On whamcloud wiki I can only find build v2.1 on RHEL 6.1 or build v1.8 on CentOS 5.6. BTW, congratulations on the 2.1 release! Regards, Jon Zhu Sent from Google Mail On Fri, Jun 24, 2011 at 2:43 PM, Oleg Drokin gr

Re: [Lustre-discuss] Slow NFS/CIFS when exporting Lustre

2011-09-26 Thread Oleg Drokin
, that would probably have pretty negative impact too. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] a version of lustre that works with RHEL55

2011-09-19 Thread Oleg Drokin
These are just warnings, I guess you skipped the errors. I would expect the patches won't apply, and thus it fails. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org

Re: [Lustre-discuss] Lustre client server on the same node, NFSv4 reexport

2011-08-18 Thread Oleg Drokin
does not work properly in 1.8 in that case. It does work with 2.1 clients. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo

Re: [Lustre-discuss] New to Lustre, test install.

2011-08-09 Thread Oleg Drokin
setting DEBUG_SIZE environment variable to something bigger. like double your CPU cores (this is in megabytes for debug buffers. if you are not interested in the debug buffers, set it to 0). Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc

Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Oleg Drokin
to try 1.8.6 and see if it improves things. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] potential issue with data corruption

2011-07-14 Thread Oleg Drokin
the clients one way or the other. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Lustre v2.1 RHEL 6.1 build does not work

2011-06-24 Thread Oleg Drokin
never seen anything like that in rhel5 xen kernels, perhaps it's something with rhel6.1 xen? Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org

Re: [Lustre-discuss] Lustre v2.1 RHEL 6.1 build does not work

2011-06-24 Thread Oleg Drokin
during the test so that's why the other client cannot list file inside it? I guess so, after I stopped the fileop test program I can get into the directory and there is nothing in it. Thanks, -Jon. On Fri, Jun 24, 2011 at 5:11 PM, Oleg Drokin gr...@whamcloud.com wrote: Did it delete

Re: [Lustre-discuss] Lustre v2.1 RHEL 6.1 build does not work

2011-06-24 Thread Oleg Drokin
checkout latest 2.1 and build it aginst your kernel-devel package (--with-linux=/lib/modules/`uname -r`/build to configure while booted into rhel6.1 kernel from RH). Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss

Re: [Lustre-discuss] What exactly is punch statistic?

2011-06-17 Thread Oleg Drokin
mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman

Re: [Lustre-discuss] MDT error messages

2011-06-07 Thread Oleg Drokin
Hello! On Jun 7, 2011, at 7:49 AM, Thomas Roth wrote: there are some new error messages on our MDT, haven't seen these before and according to Google nobody else has... The usual question: what does it mean? Something to worry about? Jun 7 06:23:53 lxmds kernel: [4565451.097596]

Re: [Lustre-discuss] df and du difference on lustre fs

2011-05-26 Thread Oleg Drokin
orphan objects. Alternatively next time your MDS restarts such orphaned objects should also be destroyed. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http

Re: [Lustre-discuss] Information regarding the FILE HANDLE

2011-05-26 Thread Oleg Drokin
of EA i.e. the extended attributes ? There are multiple things that could be called file handle, so it would be great if you explain a little bit about what is it you are actually looking for. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc

Re: [Lustre-discuss] Information regarding the FILE HANDLE

2011-05-26 Thread Oleg Drokin
it out. Is File handle part of EA i.e. the extended attributes ? There are multiple things that could be called file handle, so it would be great if you explain a little bit about what is it you are actually looking for. Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud

Re: [Lustre-discuss] Checksums of files on disk

2011-05-25 Thread Oleg Drokin
Lustre by Sun Microsystems which is now somewhat stale.) Bye, Oleg -- Oleg Drokin Senior Software Engineer Whamcloud, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Question on path name resolution in Lustre

2011-05-05 Thread Oleg Drokin
Hello! On May 5, 2011, at 2:37 AM, vilobh meshram wrote: I have noticed that for file or directory kind of operation in Lustre, the Lock Manager grabs an EX (Exclusive lock) on the parent directory and then creates a directory or file inside it.Is there a specific reason behind this logic

Re: [Lustre-discuss] Lustre 2.0 client cache size

2011-03-17 Thread Oleg Drokin
Hello! On Mar 17, 2011, at 5:44 PM, Andreas Dilger wrote: I did not find if this was removed or this was partially included in Lustre 2.0. What's the current status of this and how can I tell to my client to avoid caching too many data? The client VM usage was one of the areas that was

Re: [Lustre-discuss] clean unmounting of OST with external journal

2011-03-06 Thread Oleg Drokin
Hello! On Mar 6, 2011, at 8:43 PM, Samuel Aparicio wrote: now an attempt to re-mount the OST fails with LDISKFS-fs (md14): failed to open journal device unknown-block(152,225): -6 an e2fsck fixes this external superblock [root@OST2 ~]# e2fsck -j /dev/etherd/e9.24p1 /dev/md14 e2fsck

Re: [Lustre-discuss] osc_brw_redo_request error on clients

2011-02-09 Thread Oleg Drokin
Hello! On Feb 9, 2011, at 4:35 PM, James Robnett wrote: Normally I've had no problems but recently I have multiple clients reporting the following error: LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo for recoverable error req@8101ae084000

Re: [Lustre-discuss] LustreError

2011-01-25 Thread Oleg Drokin
Hello! It's not necessary missing, some other factors might be in play. E.g. if you have somewhat older version of Lustre and export it via NFS from this node, I think there was a bug leading to such messages. If it is indeed missing, e2fsck should fix a case where a directory entry

Re: [Lustre-discuss] write RPC congestion

2010-12-21 Thread Oleg Drokin
Hello! I guess I am a little bit late to the party, but I was just reading comments in bug 16900 and have this question I really need to ask. On Aug 23, 2010, at 10:58 PM, Jeremy Filizetti wrote: The larger RPCs from bug 16900 offered some significant performance when working over the WAN.

Re: [Lustre-discuss] write RPC congestion

2010-12-21 Thread Oleg Drokin
Hello! On Dec 22, 2010, at 12:43 AM, Jeremy Filizetti wrote: In the attachment I created that Andreas posted at https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1 and 2 they are both using larger than default max_rpcs_in_flight. I believe the data without the

Re: [Lustre-discuss] OST errors caused by residual client info?

2010-12-06 Thread Oleg Drokin
Hello! On Dec 6, 2010, at 6:50 PM, Jeff Johnson wrote: Previous test incarnations of this filesystem were built where ost name was not assigned (e.g.: OST) and was assigned upon first mount and connection to the mds. Is it possible that some clients have residual pointers or config

Re: [Lustre-discuss] OST errors caused by residual client info?

2010-12-06 Thread Oleg Drokin
Hello! On Dec 6, 2010, at 7:05 PM, Jeff Johnson wrote: Previous test incarnations of this filesystem were built where ost name was not assigned (e.g.: OST) and was assigned upon first mount and connection to the mds. Is it possible that some clients have residual pointers or config data

Re: [Lustre-discuss] target_send_reply_msg errors

2010-12-02 Thread Oleg Drokin
Hello! Essentially your client(s) got disconnected from MGS for some reason (somewhere earlier in MGS logs you should see something about that). Now the clients did not know they were disconnected and discover this sad fact next time they try to talk to MGS (sending their periodic PINGs

Re: [Lustre-discuss] [Lustre-Discuss]-

2010-11-22 Thread Oleg Drokin
Actually note that it is conflicting with existing ext4progs, not ext2, so should not be all that hard. Besides, lustre-enabled ext2 should have all the stuff that's already in ext4progs I would imagine. On Nov 22, 2010, at 11:05 PM, Alexey Lyashkov wrote: removing e2fsprogs from live system

Re: [Lustre-discuss] Broken client

2010-11-18 Thread Oleg Drokin
Hello! On Nov 18, 2010, at 7:18 AM, Herbert Fruchtl wrote: Rebooting the client doesn't change anything. Is it broken, or is there some persistent information that I need to flush? When I do an ls on a partially broken directory, I get the following two lines in /var/log/messages: Nov 18

Re: [Lustre-discuss] [Fwd: Re: Broken client]

2010-11-18 Thread Oleg Drokin
Hello! So are there any other compplaints on the OSS node when you mount that OST? Did you try to run e2fsck on the ost disk itself (while unmounted)? I assume one of the possible problems is just on0disk fs corruptions (and it might show unhealthy due to that right after mount too). Bye,

Re: [Lustre-discuss] mds/t bug

2010-09-06 Thread Oleg Drokin
Hello! On Sep 3, 2010, at 4:52 PM, John White wrote: Can someone help me out figuring out what's wrong here? We have an MDS/T that keeps causing problems. I have 2 dozen or so dumps from threads crashing. The threads in question all appear to be either ll_mdt_rdpg_ ? or

Re: [Lustre-discuss] Samba and file locking

2010-08-30 Thread Oleg Drokin
samba do that is different? We are using lustre to replace our old nfs server for serving up home directories in our cluster and the rest of our systems. On Fri, Aug 27, 2010 at 6:15 PM, Oleg Drokin oleg.dro...@oracle.com wrote: Hello! On Aug 27, 2010, at 6:41 PM, David Noriega wrote: But I

Re: [Lustre-discuss] Disabling locks in Lustre

2010-08-27 Thread Oleg Drokin
Hello! On Aug 26, 2010, at 1:07 PM, Dulcardo Arteaga Clavijo wrote: I am trying to compare the performance of Lustre for parallel write to a shared file with locks and without locks. But after doing some experiments I didn't see any performance improvement when I run without locks. It all

Re: [Lustre-discuss] Samba and file locking

2010-08-27 Thread Oleg Drokin
Hello! On Aug 27, 2010, at 6:41 PM, David Noriega wrote: But I also found out about the flock option for lustre. Should I set flock on all clients? or can I just use localflock option on the fileserver? It depends. If you are 100% sure none of your other clients use flocks in a way similar to

Re: [Lustre-discuss] Enabling async journals while the filesystem is active

2010-08-21 Thread Oleg Drokin
Hello! You would need to upgrade your clients to at least 1.8.2, otherwise you might hit bug 19128 during replay that would lead to losing some of the data being replayed. Version on the routers is not important for async journals feature. Bye, Oleg On Aug 21, 2010, at 10:06 AM, Erik

Re: [Lustre-discuss] Fwd: Lustre and Large Pages

2010-08-19 Thread Oleg Drokin
Hello! On Aug 19, 2010, at 7:07 PM, Andreas Dilger wrote: If you want to flush all the memory used by a Lustre client between jobs, you can do lctl set_param ldlm.namespaces.*.lru_size=clear. Unlike Kevin's suggestion it is Lustre-specific, while drop_caches will try to flush memory from

Re: [Lustre-discuss] Client directory entry caching

2010-08-04 Thread Oleg Drokin
Hello! On Aug 4, 2010, at 3:41 AM, Andreas Dilger wrote: mkdir(/mnt/lustre/blah2/b/c/d/e/f/g, 040755) = 0 +1 RPC lstat(/mnt/lustre/blah2/b/c/d/e/f/g, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 +1 RPC If we do the mkdir(), the client does not cache the entry? No. mkdir cannot return

Re: [Lustre-discuss] Client directory entry caching

2010-08-04 Thread Oleg Drokin
Hello! On Aug 4, 2010, at 2:04 PM, Daire Byrne wrote: Hm, initially I was going to say that find is not open-intensive so it should not benefit from opencache at all. But then I realized if you have a lot of dirs, then indeed there would be a positive impact on subsequent reruns. I assume

Re: [Lustre-discuss] Client directory entry caching

2010-08-03 Thread Oleg Drokin
Hello! On Aug 3, 2010, at 12:49 PM, Daire Byrne wrote: So even with the metadata going over NFS the opencache in the client seems to make quite a difference (I'm not sure how much the NFS client caches though). As expected I see no mdt activity for the NFS export once cached. I think it would

Re: [Lustre-discuss] Client directory entry caching

2010-08-03 Thread Oleg Drokin
like opencache isn't generally useful unless enabled on every node. Is there an easy way to force files out of the cache (ie, echo 3 /proc/sys/vm/drop_caches)? Kevin On Aug 3, 2010, at 11:50 AM, Oleg Drokin oleg.dro...@oracle.com wrote: Hello! On Aug 3, 2010, at 12:49 PM, Daire

Re: [Lustre-discuss] Client directory entry caching

2010-08-03 Thread Oleg Drokin
Hello! On Aug 3, 2010, at 10:59 PM, Jeremy Filizetti wrote: Another consideration for WAN performance when creating files is the stripe count. When you start writing to a file the first RPC to each OSC requests the lock rather then requesting the lock from all OSCs when the first lock is

Re: [Lustre-discuss] Client directory entry caching

2010-08-02 Thread Oleg Drokin
Hello! On Jul 30, 2010, at 7:20 AM, Daire Byrne wrote: Ah yes... that makes sense. I recall the opencache gave a big boost in performance for NFS exporting but I wasn't sure if it had become the default. I haven't been keeping up to date with Lustre developments. It was default for NFS for

Re: [Lustre-discuss] lock callback timer expired, lock on destroyed export, locks stolen, busy with active RPCs, operation 400 on unconnected MDS

2010-05-03 Thread Oleg Drokin
Hello! On May 3, 2010, at 11:49 AM, Thomas Roth wrote: We found a user job submission script that probably caused all this by starting - several hundred (900) jobs simultaneously - all of them opening one and the same file for batch system errors and one and the same file for its output.

Re: [Lustre-discuss] Newbie w/issues

2010-04-28 Thread Oleg Drokin
Hello! On Apr 27, 2010, at 7:29 PM, Brian Andrus wrote: Apr 27 16:15:19 nas-0-1 kernel: LustreError: 4133:0:(ldlm_lib.c:1848:target_send_reply_msg()) @@@ processing error (-107) r...@810669d35c50 x1334203739385128/t0 o400-?@?:0/0 lens 192/0 e 0 to 0 dl 1272410135 ref 1 fl

Re: [Lustre-discuss] Newbie w/issues

2010-04-28 Thread Oleg Drokin
Hello! On Apr 27, 2010, at 9:38 PM, Brian Andrus wrote: Odd, I just went through the log on the MDT and basically it has been repeating those errors for over 24 hours (not spewing, but often enough). only ONE other line on an ost: Each such message means there was an attempt to send a ping

Re: [Lustre-discuss] Running fsck on disabled ost?

2010-03-18 Thread Oleg Drokin
Hello! On Mar 18, 2010, at 1:36 PM, Roy Dragseth wrote: Is it possible to fsck on a disabled and drained OST that is mounted readonly? We need to fsck an OST and would like to avoid a lengthy downtime while doing it. My plan is to disable and drain the files from the OST and then remount

Re: [Lustre-discuss] RPC limitation

2010-03-09 Thread Oleg Drokin
Hello! This only works if all the requests are for the same file, then it is done for you automatically (assuming that these are write requests and there is not sync in between. It's impossible to do for reads for obvious reason that read is a synchronous operation and by the time we

Re: [Lustre-discuss] Question regarding caution statement in 1.8 manual for the consistent mode flock option

2010-03-05 Thread Oleg Drokin
Hello! On Mar 5, 2010, at 5:25 PM, Andreas Dilger wrote: On 2010-03-05, at 15:18, Jagga Soorma wrote: Is there an impact if the option is turned on, or only if it is turned on and used? Is the impact local to the file being locked, the machine on which that file is locked, or the entire

Re: [Lustre-discuss] One or two OSS, no difference?

2010-03-03 Thread Oleg Drokin
Hello! On Mar 3, 2010, at 6:35 PM, Jeffrey Bennett wrote: We are building a very small Lustre cluster with 32 clients (patchless) and two OSS servers. Each OSS server has 1 OST with 1 TB of Solid State Drives. All is connected using dual-port DDR IB. For testing purposes, I am

Re: [Lustre-discuss] High Load and high system CPU for mds

2010-03-01 Thread Oleg Drokin
Hello! On Feb 28, 2010, at 9:31 PM, huangql wrote: We got a problem that the MDS has high load value and the system CPU is up to 60% when running chown command on client. It's strange that the load value and system CPU didn't decrease to the normal level as long as it getted high. Even we

Re: [Lustre-discuss] renamed directory retains a dentry under its old name?

2009-11-19 Thread Oleg Drokin
Hello! On Nov 19, 2009, at 7:06 AM, Phil Schwan wrote: Hello old friends! I return with a gift, like an almost-forgotten uncle visiting from a faraway land. Long time no see! ;) I have an interesting issue, on 1.6.6: # cat /proc/fs/lustre/version lustre: 1.6.6 kernel: patchless build:

Re: [Lustre-discuss] Soft CPU Lockup

2009-10-05 Thread Oleg Drokin
Hello! On Oct 5, 2009, at 4:40 PM, Hendelman, Rob wrote: It looks like the threads finally died The 2 cpu cores that were pegged at 100% are idle again. That seems like one heck of a timeout... Was there a client eviction right before this message? The watchdog trace from your previous

Re: [Lustre-discuss] Lustre client kernel panic

2009-09-26 Thread Oleg Drokin
Hello! On Sep 26, 2009, at 1:57 AM, Nick Jennings wrote: About an hour ago the client completely hung. Hosting co. says it was a kernel panic. I got not useful feedback in /var/log/messages from the client or the MDS. However from the OST I got several complaints. (below). Does anyone

Re: [Lustre-discuss] Lustre client kernel panic

2009-09-26 Thread Oleg Drokin
Hello! On Sep 26, 2009, at 9:37 AM, Brian J. Murrell wrote: Unfortunately that was the only info I could get. The client had no information in the logs about what happened. They usually don't when they panic. Right. RHEL configured to have panic on oops too, if you disable that (in /

Re: [Lustre-discuss] Large directories optimization

2009-09-23 Thread Oleg Drokin
Hello! On Sep 23, 2009, at 7:47 AM, Lukas Hejtmanek wrote: I limit oss_num_threads instead? Yes. (they are the same thing anyway) Thanks Oleg. One more question, this limit is per kernel module or per OST mount? E.g., I have 1 physical server that hosts 2 OST servers - OST0, OST1. This

Re: [Lustre-discuss] Large directories optimization

2009-09-22 Thread Oleg Drokin
Hello! On Sep 22, 2009, at 7:10 AM, Lukas Hejtmanek wrote: On Thu, Sep 17, 2009 at 04:17:54PM -0400, Oleg Drokin wrote: If you bring down the load on the OSTs (read this list, recently there were several methods discussed like bringing down number of service threads) that should help

Re: [Lustre-discuss] What are the current striping limits?

2009-09-20 Thread Oleg Drokin
Hello! On Sep 20, 2009, at 3:51 PM, Geoff Lustre wrote: Dear List The excellent wiki: http://www.inter-mezzo.org/index.php?title=MDS_striping_format states that Note: Limits for stripe settings are: • Maximum striping count for a single file is 160. This is still current (work in

Re: [Lustre-discuss] Large directories optimization

2009-09-17 Thread Oleg Drokin
Hello! On Sep 17, 2009, at 7:28 AM, Lukas Hejtmanek wrote: LustreError: 11-0: an error occurred while communicating with x.x@tcp. The mds_connect operation failed with -16 Lustre: Request x112815827 sent from stable-OST0001-osc- 8802855b7800 to NID x.x@tcp 100s ago has timed

Re: [Lustre-discuss] Kernel Panic in MDS

2009-09-11 Thread Oleg Drokin
Hello! On Sep 11, 2009, at 1:17 AM, Muruga Prabu M wrote: I have a small java application that uploads images into the lustre filesystem. When I try to upload images from the application, the MDS server crashes and kernel panic happens. I have attached the ouptut of the 'dmesg', and the

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-11 Thread Oleg Drokin
Hello! On Sep 11, 2009, at 9:33 AM, Aaron Knister wrote: Is the read cache corruption actually causing on-disk corruption? Or just in-memory corruption? I'm assuming the write cache corruption would end up causing the file to become corrupt on disk, but if a node crashes during a write

Re: [Lustre-discuss] OSTs hanging while running IOR

2009-09-09 Thread Oleg Drokin
Hello! On Sep 9, 2009, at 1:31 PM, Rafael David Tinoco wrote: One of my OSSs crashes, sometimes one, sometimes another. With the following error: That's not a crash. That's watchdog timeout indicative of lustre spending too much time waiting on io. As such you need to somehow decrease the

Re: [Lustre-discuss] WARNING: data corruption issue found in 1.8.x releases

2009-09-09 Thread Oleg Drokin
Hello! On Sep 9, 2009, at 2:07 PM, Charles A. Taylor wrote: Anyway, your email concerned us so we issued the recommended commands on our OSSs to disable the caching. That promptly crashed two of our OSSs. We got the servers back up and after fsck'ing (fsck.ext4) all the OSTs and

Re: [Lustre-discuss] Bad read performance

2009-08-20 Thread Oleg Drokin
Hello! Any chance you can use more modern release like 1.8.1? There was a number of bugs fixed including some readahead-logic fixes that could impede read performance. Bye, Oleg On Aug 20, 2009, at 10:38 PM, Alvaro Aguilera wrote: Thanks for pointing that out. I was using the

Re: [Lustre-discuss] MDS refuses connections (no visible reason)

2009-08-18 Thread Oleg Drokin
Hello! On Aug 18, 2009, at 4:27 AM, Patricia Santos Marco wrote: Our MDT have lustre 1.6.7, I see in this message http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html that this version have a bug that cause directory corruptions on the MDT. Can this bug produce this

Re: [Lustre-discuss] MDS refuses connections (no visible reason)

2009-08-18 Thread Oleg Drokin
Hello! On Aug 18, 2009, at 8:23 AM, Mag Gam wrote: just curious, if you didn't compile your own kernel, how do you apply this patch? Is our only option to upgrade via RPMS or is there another way to apply the patch? This patch is to lustre itself, not to a kernel. So you just need lustre

Re: [Lustre-discuss] MDS refuses connections (no visible reason)

2009-08-17 Thread Oleg Drokin
Hello! On Aug 17, 2009, at 2:14 PM, Patricia Santos Marco wrote: The last day our MDS refusing conections too. The logs are the same, and we should reboot the MDS server . What's is the reason for this? That means some requests from this client are still being processed and server has a

Re: [Lustre-discuss] Lustre playground in VirtualBox?

2009-08-10 Thread Oleg Drokin
Hello! On Aug 10, 2009, at 9:39 AM, Wolfgang Stief wrote: Before I start installing and fiddling around: Are there any reasons AGAINST setting up a Lustre playground in a VirtualBox environment? I just want to play around w/ recovery and debugging situations and upgrades. No performance

Re: [Lustre-discuss] Large scale delete results in lag on clients

2009-08-10 Thread Oleg Drokin
Hello! On Aug 10, 2009, at 11:03 PM, Jim McCusker wrote: On Monday, August 10, 2009, Oleg Drokin oleg.dro...@sun.com wrote: What lustre version is it now? We used to have uncontrolled unlinking where OSTs might get swamped with unlink requests. Now we limit to 8 unlinks to OST at any

Re: [Lustre-discuss] Inode errors at time of job failure

2009-08-06 Thread Oleg Drokin
Hello! On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote: Hi, these ll_inode_revalidate_fini errors are unfortunately quite known to us. So what would you guess if that happens again and again, on a number of clients - MDT softly dying away? No, I do not think this is MDT problem of any

Re: [Lustre-discuss] Inode errors at time of job failure

2009-08-05 Thread Oleg Drokin
Hello! On Aug 5, 2009, at 3:12 PM, Daniel Kulinski wrote: What would cause the following error to appear? Typically this is some sort of a race where you presume an inode exist (because you have some traces of it in memory), but it is not anymore (on mds, anyway). So when client comes to

Re: [Lustre-discuss] 1.8 : recurrent LBUG's on clients

2009-08-04 Thread Oleg Drokin
Hello! On Jul 31, 2009, at 3:15 AM, Guillaume Demillecamps wrote: All servers and clients are having Lustre 1.8, on SLES 10 SP2. Clients use patchless kernels, using same base revision as the ones for the patched kernel servers. We recurrently encounter this error : Chances are you are

Re: [Lustre-discuss] odd file entries

2009-07-24 Thread Oleg Drokin
Hello! On Jul 24, 2009, at 7:04 PM, Andreas Dilger wrote: On Jul 24, 2009 15:29 -0700, John White wrote: So we have a new file system set up. beefy OSTs, but certainly under- sized metadata (we're still figuring out what we'll use in the end). We've just started to do friendly-user

Re: [Lustre-discuss] lustre-discuss-list - 10 new messages in 5 topics - digest

2009-07-19 Thread Oleg Drokin
Hello! On Jul 17, 2009, at 2:01 PM, Ettore Enrico Delfino Ligorio wrote: In my experince, the integration between most recent kernels with glusterfs and patches of Xen hypervisor works well. The same with Lustre is harder to do. Works out of the box for me both with rhel5 kernels (that

Re: [Lustre-discuss] NFS export problem

2009-05-15 Thread Oleg Drokin
Hello! On May 15, 2009, at 7:39 AM, Ralf Utermann wrote: so now I am sure to have libcfs-* enabled modules (probably the Debian packages also had it, it's not disabled in the configure call) and did this test again, however I still do not get any debug lines after accessing the NFS

Re: [Lustre-discuss] NFS export problem

2009-05-14 Thread Oleg Drokin
Hello! On May 14, 2009, at 4:05 AM, Ralf Utermann wrote: Hm, that's really strange. I hope you did not built your Lustre with --disable-libcfs-* configure options? how can I check this? The modules have been built with debian utilities (m-a build ...) I suppose you can take a look at the

Re: [Lustre-discuss] NFS export problem

2009-05-13 Thread Oleg Drokin
Hello! On May 13, 2009, at 7:53 AM, Ralf Utermann wrote: What might be useful is if you can reproduce this quickly n as few set of Lustre nodes as possible. remember your current /proc/sys/lnet/debug value. on lustre-client/nfs-server and on MDS echo -1 /proc/sys/lnet/debug then do lctl

Re: [Lustre-discuss] NFS export problem

2009-05-13 Thread Oleg Drokin
Hello! On May 13, 2009, at 10:48 AM, Ralf Utermann wrote: Oleg Drokin wrote: [...] Either Lustre never got any control at all and your problem is unrelated to lustre and related to something else in your system or the logging is broken somewhat. The way to test it is to do ls -la /mnt

Re: [Lustre-discuss] mmap()

2009-05-13 Thread Oleg Drokin
Hello! On May 13, 2009, at 8:35 AM, Mag Gam wrote: I have an application which I would like to use Lustre as the backing storage. However, the application (MonetDB) uses mmap(). Would the application have any problems if using Lustre as its backing storage? There should be no problems in

Re: [Lustre-discuss] Lustre 1.8.0.50 + Xen + kernel 2.6.22.17

2009-05-11 Thread Oleg Drokin
Hello! On Apr 20, 2009, at 6:04 PM, Lukas Hejtmanek wrote: On Mon, Apr 20, 2009 at 02:42:40PM -0600, Andreas Dilger wrote: The core looks like this: #1 0x2b7d5ff825a2 in DumpModeDecode (tif=0x58cdd0, buf=0xf7f7f7f5f5f5f6f6 Address 0xf7f7f7f5f5f5f6f6 out of bounds, cc=76800, s=2016)

Re: [Lustre-discuss] Solid State MDT

2009-04-13 Thread Oleg Drokin
Hello! On Apr 13, 2009, at 12:55 PM, Jim Garlick wrote: 2) some quick tests of MDS create rates (through lustre now) on the SSD and DDN hardware where we seemed to get about 2350 creates/sec no matter what hardware we used, and posts from Oleg on this mailing list indicating that

Re: [Lustre-discuss] WARNING: Potential directory corruptions on the MDS with 1.6.7

2009-04-06 Thread Oleg Drokin
Hello! On Apr 6, 2009, at 10:41 AM, Peter Kjellstrom wrote: On Friday 03 April 2009, Thomas Wakefield wrote: Any idea on the timeline for 1.6.7.1 ? Will it be out today, or just sometime soon? Knowing if it's hours away or awaiting a complete qa-cycle would be nice. That would decide if

Re: [Lustre-discuss] OSS Cache Size for read optimization

2009-04-03 Thread Oleg Drokin
Yes, it is for dirty cache limiting on a per-osc basis. There is also /proc/fs/lustre/llite/*/max_cached_mb that regulates how much cached data per client you can have. (default is 3/4 of RAM) On Apr 3, 2009, at 2:52 PM, Lundgren, Andrew wrote: The parameter is called dirty, is that write

Re: [Lustre-discuss] LustreError: lock callback timer expired after

2009-03-30 Thread Oleg Drokin
Hello! On Mar 30, 2009, at 7:06 AM, Simon Latapie wrote: I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and 37 lustre clients (1 login and 36 compute nodes), all using infiniband as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel. For the past two

Re: [Lustre-discuss] Stupid Question Time: flock

2009-03-24 Thread Oleg Drokin
Hello! On Mar 24, 2009, at 3:19 PM, Jay Christopherson wrote: If I have 5 clients, two of which are running an app which requires fcntl style file locking, do I need to mount lustre with -o flock on all five clients, or just the two that are using fcntl? Just two would be fine. What

Re: [Lustre-discuss] Group Lock in Lustre: write call is blocking

2009-03-18 Thread Oleg Drokin
Hello! On Mar 16, 2009, at 5:41 AM, pascal.dev...@bull.net wrote: Could anyone tell me if I made a mistake, if Lustre does not support the group lock or if it is a bug in Lustre ? Thank you for bringing this to our attention. Please file a bug. This is a bug in lustre introduced by lockless

Re: [Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

2009-02-10 Thread Oleg Drokin
Hello! On Feb 10, 2009, at 12:11 PM, Simon Kelley wrote: We are also seeing some userspace file operations fail with the error No locks available. These don't generate any logging on the client so I don't have exact timing. It's possible that they are associated with further ###

Re: [Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

2009-02-10 Thread Oleg Drokin
Hello! On Feb 10, 2009, at 12:46 PM, Simon Kelley wrote: If, by the complete event you mean the received cancel for unknown cookie, there's not much more to tell. Grepping through the last month's server logs shows that there are bursts of typically between 3 and 7 messages, at the same

Re: [Lustre-discuss] How to change inode capacity

2009-01-30 Thread Oleg Drokin
Hello! On Jan 29, 2009, at 9:58 PM, Satoshi Isono wrote: http://wiki.lustre.org/index.php?title=Lustre_FAQ * What is the maximum number of files in a single file system? In a single directory? So, if we use current Lustre 1.6.x on EXT3, we can only support single MDT. Then, according

Re: [Lustre-discuss] simultaneous export of lustre fs via NFS and Samba

2009-01-25 Thread Oleg Drokin
Hello! On Jan 24, 2009, at 9:08 PM, Craig Prescott wrote: * any problem (from Lustre's perspective) to run the NFS server and Samba server from the same client? No. * on the NFS/Samba server host, shoud I mount with certain options, such as -oflock? If you mount with -o flock and plan

Re: [Lustre-discuss] OSS Service Thread Count

2009-01-25 Thread Oleg Drokin
Hello! On Jan 25, 2009, at 6:56 PM, Wojciech Turek wrote: For my particular case it gives 512 ost_num_threads which is the Lustre max number for this particular parameter. Manual says that each thread uses actually 1.5MB of RAM, so 768MB of RAM will be consumed on each of my OSSs for

Re: [Lustre-discuss] kernel panic with 1.6.5rc2 on mds

2008-05-18 Thread Oleg Drokin
Hello! On May 16, 2008, at 6:45 AM, Patrick Winnertz wrote: As I wrote in #11742 [1] I experienced a kernel panic after doing heavy I/O on the 1.6.5rc2 cluster on the mds. Since nobody answered to this bug until now (and I think in other cases the lustre team is _really_ fast (thanks for

  1   2   >