On Mar 21, 2008 19:15 +0100, Dilling wrote: > some days ago one of my users started a lot of matlab jobs flooding all > processors on our 40 nodes 2CPU cluster (4Core System). More details and > log files can be found in the appendix. As a result of this I observed a > strange behavior of lustre. Ptlrpcd used 100% of one CPU, the second CPU > was completly occupied by pwd. Pwd was a child of the matlab process > invoked by the user. I/O on lustre was partly possible but df reported > access denied. A recovery with the mdt started after lustre.timeout=300 but > did not complete. I had to reboot all nodes which showed this behavior. The > ost showed the message: > Mar 14 16:47:05 cn46 kernel: LustreError: 138-a: lustre-OST0003: A client > on nid [EMAIL PROTECTED] was evicted due to a lock glimpse callback to > [EMAIL PROTECTED] timed out: rc -110 > The client kernels reported soft lockup on all available cores. > Does anyone have an idea how to prevent such behavior. Thanks for your help.
You missed an important detail right at the beginning of your woes: ll_sai_entry_set()) ASSERTION(entry->se_stat == SA_ENTRY_UNSTATED) failed This is a bug in the "statahead" code. This is a new feature which detects apps doing "readdir + sequential stat" operations on a directory and starts multiple concurrent metadata RPCs in order to hide the network latency of the serialized "stat" operations. This is a known bug 15175 in our bugzilla and is being worked on. You can disable statahead on the clients until this is resolved: echo 0 > /proc/fs/lustre/llite/*/statahead_count Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss