Re: [Lustre-discuss] MGS disk size and activity

2009-10-09 Thread Wojciech Turek
Hi,

I am very interested in finding out how to move co-located MGS to separate
disk. I will be moving my MDTs to new hardware soon and I would like to
separate MGS from MDT. I will be grateful for some info on this subject
please.

Many thanks,

Wojciech

2008/6/23 Andreas Dilger 

> On Jun 17, 2008  12:40 -0700, Klaus Steden wrote:
> > I have a question ... if the MGS is used so infrequently relative to the
> use
> > of the MDS, why is it (is it?) problematic to locate it on the same
> volume
> > as the MDT?
>
> If you have multiple MDTs on the same MDS node (i.e. multiple Lustre
> filesystems) then it is difficult to start up the MGS separately from
> the MDT if it is co-located with one of the MDTs.  It isn't impossible
> (with some manual mounting of the underlying filesystems) to move a
> co-located MGS to a separate filesystem if needed.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wj...@cam.ac.uk
Tel: (+)44 1223 763517
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

2009-10-09 Thread Wojciech Turek
Hi,

Did you get to the bottom of this?

We are having exactly the same problem with our lustre-1.6.6 (rhel4)  file
systems. Recently it got worst and MDS crashes quite frequently, when we run
e2fsck there are errors that are being fixed. However after some time we
still are seeing  the same errors in the logs about missing objects and
files get corrupted (?---) Also clients LBUGs quite frequently with
this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
This looks like serious lustre problem but so far I didn't find any clues on
that even after long search through lustre bugzilla.

Our MDSs and OSSs are UPSed, RAID is behaving OK, we don't see any errors in
the syslog.

I will be grateful for some hints on this one

Wojciech

2009/8/24 rishi pathak 

> Hi,
>
> Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as
> well as servers run lustre-1.6 and kernel 2.6.9-18.
>
>Doing a ls -ltr for a directory in lustre fs throws following errors
> (as got from lustre logs) on client
>
> 0008:0002:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set_data_with_check())
> ### inconsistent l_ast_data found ns: scratch-OST0005-osc-81201e8dd800
> lock: 811f9af04
> 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type:
> EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 10
> remote: 0xb79b445e381bc9e6 expref: -99 p
> id: 22878
> 0008:0004:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
> ASSERTION(old_inode->i_state & I_FREEING) failed:Found existing inode
> 811f2cf693b8/1972725
> 44/1895600178 state 0 in lock: setting data to
> 8118ef8ed5f8/207519777/1771835328
> :0004:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
> LBUG
>
>
> On scratch-OST0005 OST it shows
>
> Aug 24 10:22:53 yn266 kernel: LustreError:
> 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
> ce 569204: rc -2
> Aug 24 10:22:53 yn266 kernel: LustreError:
> 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous similar
>  messages
> Aug 24 12:40:43 yn266 kernel: LustreError:
> 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
> ce 569195: rc -2
> Aug 24 12:44:59 yn266 kernel: LustreError:
> 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
> ce 569198: rc -2
>
> These kind of errors we are getting for many clients.
>
> ##History ##
> Prior to thsese occurences, our MDS showed signs of failure in way that cpu
> load was shooting above 100 (on a quad core quad socket system) and users
> were complaining about slow storage performance. We took it offline and did
> fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it showed some
> errors which were fixed. For data integrity check, mdsdb and ostdb were
> built and lfsck was run on a client(client was mounted with abort_recov).
>
> lfsck was run in following order:
> lfsck with no fix - reported dangling inodes and orphaned objects
> lfsck with -l (backup orphaned objects)
> lfsck with -d and -c (delete orphaned objects and create missing OST
> objects referenced by MDS)
>
> After above operations, on clients we were seeing file in red and blinking.
> Doing a stat came out with an error stating 'no such file or directory'.
>
> My question is whether the order in which lfsck was run (should lfsck be
> run multiple times) and  the errors we are getting are related or not.
>
>
>
>
> --
> Regards--
> Rishi Pathak
> National PARAM Supercomputing Facility
> Center for Development of Advanced Computing(C-DAC)
> Pune University Campus,Ganesh Khind Road
> Pune-Maharastra
>
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>


-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wj...@cam.ac.uk
Tel: (+)44 1223 763517
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?

2009-10-09 Thread Andreas Dilger
On 8-Oct-09, at 22:28, Lundgren, Andrew wrote:
> Is there a way to set the lru_size to a fixed value and have it stay  
> that way across mounts?
>
> I know it can be set using:
> $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
> But that isn’t retained across a reboot.


"lctl set_param" is only for temporary tunable setting.  You can use
"lctl conf_param" to set a permanent tunable.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Is there a way to set lru_size and have it stick?

2009-10-09 Thread Bernd Schubert
On Friday 09 October 2009, Lundgren, Andrew wrote:
> Is there a way to set the lru_size to a fixed value and have it stay that
>  way across mounts?
> 
> I know it can be set using:
> $ lctl set_param ldlm.namespaces.*osc*.lru_size=$((NR_CPU*100))
> But that isn't retained across a reboot.

Even worse, if for some reason, e.g. evictions connection to OSTs get lost, it 
will also also reset to default. We are for now compiling our packages LRU 
disabled.

-- 
Bernd Schubert
DataDirect Networks
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Arne Brutschy
Hi,

ok, thanks then. I was hoping there's something like "doing this or that
operation in abundance will cause this sort of behavior". :)

Well then, I will add a serial console and the lustre monitoring. I will
then try to send the MDS into overload, as we have by now managed to
isolate the user who is apparently causing the overload.

Thanks for now and have a nice weekend
Arne


On Fr, 2009-10-09 at 10:23 -0400, Brian J. Murrell wrote:
> On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote:
> > Hi,
> 
> Hi,
> 
> > thanks for replying!
> 
> NP.
> 
> > I understand that without further information we can't do much about the
> > oopses.
> 
> Not just the oopses.  It's pretty hard to diagnose any problem without
> data.
> 
> > I was more hoping for some information regarding possible
> > sources of such an overload.
> 
> The possibilities are endless.  Lustre is a complex system with many,
> many interactions with many other components.  Sure, I could make stab
> in the dark guesses like too little memory, too slow disk, etc., but
> none of those are actually useful for anything more than a shopping trip
> and even then new hardware might still not solve your problem.
> 
> > How can I find the source of the problem?
> 
> Examine the logs.  Lustre logs a lot of information when things are
> going wrong.
> 
> b.
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
Arne Brutschy
Ph.D. StudentEmailarne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6  Web  iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel  +32 2 650 3168
Avenue Franklin Roosevelt 50 Fax  +32 2 650 2715
1050 Bruxelles, Belgium  (Fax at IRIDIA secretary)

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] e2fsck: undefined symbol: ext2_attr_index_prefi

2009-10-09 Thread Ken Hornstein
I have Lustre 1.8.1 running on a bunch of SLES 11/x86_64 systems.  I'm
using the stock binaries from www.sun.com.  Everything is fine ... except
that some of the e2fsprogs utilites are unhappy.  Specifically, if I try
to run e2fsck, I get:

# e2fsck /dev/sdb
e2fsck: symbol lookup error: e2fsck: undefined symbol: ext2_attr_index_prefix

I have, of course, the latest e2fsprogs that were released with 1.8.1:

# rpm -q -a | grep e2fsprogs
e2fsprogs-1.41.6.sun1-0suse

(Occasionally tunefs.lustre complains about a missing symbol as well, but
it has "mmp" in the name.  But that doesn't happen always).

What am I doing wrong?  I was not involved with the installation of the
SLES 11 system, but I was under the impression it was pretty vanilla.

--Ken
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Michael Kluge
LMT (http://code.google.com/p/lmt) might be able to give some hints if
users are using the FS in a 'wild' fashion. For the question "what can
cause this behaviour of my MDS" I guess the answer is like: a million
things ;) There is no way of being more specific with more input about
the problem itself.

Michael

Am Freitag, den 09.10.2009, 16:15 +0200 schrieb Arne Brutschy:
> Hi,
> 
> thanks for replying!
> 
> I understand that without further information we can't do much about the
> oopses. I was more hoping for some information regarding possible
> sources of such an overload. Is it normal that a MDS gets overloaded
> like this, while the OSTs have nothing to do, and what can I do about
> it? How can I find the source of the problem?
> 
> More specifically, what are the operations that lead to a lot of MDS
> load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is
> not a top-notch server, it's fairly recent and I feel the load we're
> experiencing is not handable by a single MDS.
> 
> My problem is that I can't make out major problems in the user's jobs
> running on the cluster, and I can't quantify nor track down the problem
> because I don't know what behavior might have caused it. 
> 
> As I said, ooppses appeared only twice, and all other problems where
> just apparent by a non-responsive MDS.
> 
> Thanks,
> Arne
> 
> 
> On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote:
> > On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 
> > > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > > > messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was not
> > > exessively used... Is this a bug or a user's job that creates the load?
> > > How can I protect lustre against this kind of failure?
> > 
> > Without any more information we could not possibly know.  If you really
> > are getting oopses then you will need console logs (i.e. serial console)
> > so that we can see the stack trace.
> > 
> > b.
> > 
> > ___
> > Lustre-discuss mailing list
> > Lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Brian J. Murrell
On Fri, 2009-10-09 at 16:15 +0200, Arne Brutschy wrote:
> Hi,

Hi,

> thanks for replying!

NP.

> I understand that without further information we can't do much about the
> oopses.

Not just the oopses.  It's pretty hard to diagnose any problem without
data.

> I was more hoping for some information regarding possible
> sources of such an overload.

The possibilities are endless.  Lustre is a complex system with many,
many interactions with many other components.  Sure, I could make stab
in the dark guesses like too little memory, too slow disk, etc., but
none of those are actually useful for anything more than a shopping trip
and even then new hardware might still not solve your problem.

> How can I find the source of the problem?

Examine the logs.  Lustre logs a lot of information when things are
going wrong.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST retirement

2009-10-09 Thread Arne Wiebalck

Hi Ramiro,

I did the retirement on a test setup, just to familiarize myself
with the procedure of draining and retiring/replacing OSSs.
The test installation is a  1.8.0.1 Lustre with 40 active and
12 inactive OSTs ;-)

Cheers,
 Arne




Ramiro Alba Queipo wrote:

Arne,

This is the same question I've put two months ago, with no answer. Which
kind of installation do you have?

Cheers

On Wed, 2009-10-07 at 15:21 +0200, Arne Wiebalck wrote:

Dear list,

is there a way to see the difference whether I deactivated
an OST filesystem-wide by

lctl --device 15 conf_param pps-OST000a.osc.active=0

   or only locally by

lctl set_param osc.pps-OST000a-osc.active=0?

And: After deactivation, I see the OSTs still on the device
list (as inactive): is there a way to completely remove them,
so that new clients would have no idea they ever existed (and
the uuids get eventually reused)?

TIA,
  Arne

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Arne Brutschy
Hi,

thanks for replying!

I understand that without further information we can't do much about the
oopses. I was more hoping for some information regarding possible
sources of such an overload. Is it normal that a MDS gets overloaded
like this, while the OSTs have nothing to do, and what can I do about
it? How can I find the source of the problem?

More specifically, what are the operations that lead to a lot of MDS
load and none for the OSTs? Although our MDS (8GB ram, 2x4core, SATA) is
not a top-notch server, it's fairly recent and I feel the load we're
experiencing is not handable by a single MDS.

My problem is that I can't make out major problems in the user's jobs
running on the cluster, and I can't quantify nor track down the problem
because I don't know what behavior might have caused it. 

As I said, ooppses appeared only twice, and all other problems where
just apparent by a non-responsive MDS.

Thanks,
Arne


On Fr, 2009-10-09 at 07:44 -0400, Brian J. Murrell wrote:
> On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> > 
> > The clients showed the following error:
> > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 
> > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > > messages
> > 
> > So, my question is: what could cause such a load? The cluster was not
> > exessively used... Is this a bug or a user's job that creates the load?
> > How can I protect lustre against this kind of failure?
> 
> Without any more information we could not possibly know.  If you really
> are getting oopses then you will need console logs (i.e. serial console)
> so that we can see the stack trace.
> 
> b.
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
Arne Brutschy
Ph.D. StudentEmailarne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6  Web  iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel  +32 2 650 3168
Avenue Franklin Roosevelt 50 Fax  +32 2 650 2715
1050 Bruxelles, Belgium  (Fax at IRIDIA secretary)

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Michael Kluge
Hmm. Should be enough. I guess you need to set up a loghost for syslog
then and a reliable serial console to get stack traces. Everything else
would be just a wild guess (as the question for the ram size was).

Michael

> Hi,
> 
> 8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz
> 
> Arne
> 
> On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:
> > Hi Arne,
> > 
> > could be memory pressure and the OOM running and shooting at things. How
> > much memory does you server has?
> > 
> > 
> > Michael
> > 
> > Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> > > Hi everyone,
> > > 
> > > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> > > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> > > 2.6.9-78.0.22.
> > > 
> > > We were quite happy with lustre's performance, especially because
> > > bottlenecks caused by /home disk access were history.
> > > 
> > > Saturday, the cluster went down (= was inaccessible). After some
> > > investigation I found out that the reason seems to be an overloaded MDS.
> > > Over the following 4 days, this happened multiple times and could only
> > > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> > > 
> > > The MDS did not respond to any command, if I managed to get a video
> > > signal (not often), load was >170. Additionally, 2 times kernel oops got
> > > displayed, but unfortunately I have to record of them.
> > > 
> > > The clients showed the following error:
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 
> > > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > > > messages
> > > 
> > > So, my question is: what could cause such a load? The cluster was not
> > > exessively used... Is this a bug or a user's job that creates the load?
> > > How can I protect lustre against this kind of failure?
> > > 
> > > Thanks in advance,
> > > Arne 
> > > 
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Arne Brutschy
Hi,

8GB of ram, 2x 4core Intel Xeon E5410 @ 2.33GHz

Arne

On Fr, 2009-10-09 at 12:16 +0200, Michael Kluge wrote:
> Hi Arne,
> 
> could be memory pressure and the OOM running and shooting at things. How
> much memory does you server has?
> 
> 
> Michael
> 
> Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> > Hi everyone,
> > 
> > 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> > MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> > 2.6.9-78.0.22.
> > 
> > We were quite happy with lustre's performance, especially because
> > bottlenecks caused by /home disk access were history.
> > 
> > Saturday, the cluster went down (= was inaccessible). After some
> > investigation I found out that the reason seems to be an overloaded MDS.
> > Over the following 4 days, this happened multiple times and could only
> > be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> > 
> > The MDS did not respond to any command, if I managed to get a video
> > signal (not often), load was >170. Additionally, 2 times kernel oops got
> > displayed, but unfortunately I have to record of them.
> > 
> > The clients showed the following error:
> > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 
> > > 304/456 e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > > Oct  8 09:58:55 majorana kernel: LustreError: 
> > > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > > messages
> > 
> > So, my question is: what could cause such a load? The cluster was not
> > exessively used... Is this a bug or a user's job that creates the load?
> > How can I protect lustre against this kind of failure?
> > 
> > Thanks in advance,
> > Arne 
> > 
-- 
Arne Brutschy
Ph.D. StudentEmailarne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6  Web  iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel  +32 2 650 3168
Avenue Franklin Roosevelt 50 Fax  +32 2 650 2715
1050 Bruxelles, Belgium  (Fax at IRIDIA secretary)

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] OST retirement

2009-10-09 Thread Ramiro Alba Queipo
Arne,

This is the same question I've put two months ago, with no answer. Which
kind of installation do you have?

Cheers

On Wed, 2009-10-07 at 15:21 +0200, Arne Wiebalck wrote:
> Dear list,
> 
> is there a way to see the difference whether I deactivated
> an OST filesystem-wide by
> 
> lctl --device 15 conf_param pps-OST000a.osc.active=0
> 
>or only locally by
> 
> lctl set_param osc.pps-OST000a-osc.active=0?
> 
> And: After deactivation, I see the OSTs still on the device
> list (as inactive): is there a way to completely remove them,
> so that new clients would have no idea they ever existed (and
> the uuids get eventually reused)?
> 
> TIA,
>   Arne
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
Ramiro Alba

Centre Tecnològic de Tranferència de Calor
http://www.cttc.upc.edu


Escola Tècnica Superior d'Enginyeries
Industrial i Aeronàutica de Terrassa
Colom 11, E-08222, Terrassa, Barcelona, Spain
Tel: (+34) 93 739 86 46


-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.
For all your IT requirements visit: http://www.transtec.co.uk

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.

2009-10-09 Thread Brian J. Murrell
On Fri, 2009-10-09 at 11:25 +0100, Christopher J.Walker wrote:
> 
> I've just seen it with a 1.8.1 server (and 1.6 clients):
> 
> Oct  4 18:21:30 sn01 kernel: Pid: 831, comm: ll_mdt_39 Tainted: G 
> 2.6.18-128.1.14.el5_lustre.1.8.1 #1
> Oct  4 18:21:30 sn01 kernel: RIP: 0010:[] 
> [] :obdclass:lustre_hash_for_each_empty+0x220/0x2b0

This is bug 19557.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Brian J. Murrell
On Fri, 2009-10-09 at 10:26 +0200, Arne Brutschy wrote:
> 
> The clients showed the following error:
> > Oct  8 09:58:55 majorana kernel: LustreError: 
> > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 304/456 
> > e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > Oct  8 09:58:55 majorana kernel: LustreError: 
> > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > messages
> 
> So, my question is: what could cause such a load? The cluster was not
> exessively used... Is this a bug or a user's job that creates the load?
> How can I protect lustre against this kind of failure?

Without any more information we could not possibly know.  If you really
are getting oopses then you will need console logs (i.e. serial console)
so that we can see the stack trace.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] 1.8.0 Losing connection to the MDT for several minutes and then recovering.

2009-10-09 Thread Christopher J.Walker
Lundgren, Andrew wrote:
> I haven’t done anything yet.  The machines seem to reconnect without 
> intervention.   But the problem occurs again later.
> 
>  
> 
> *From:* Đào Thị Thảo [mailto:tha...@isds.vn]
> *Sent:* Tuesday, October 06, 2009 9:43 PM
> *To:* Lundgren, Andrew
> *Cc:* lustre-discuss
> *Subject:* Re: [Lustre-discuss] 1.8.0 Loosing connection to the MDT for 
> several minutes and then recovering.
> 
>  
> 
> hi,
> I have the same problem with Lundgren.
> I don't understand why it's happen. While, my network is still stable.
> This problem is repeated. Some time, client can't connection restored to 
> service (OSTs or MDS) . My  provisional measure is rebooting the node.
> Andrew, can you explain more detail and guide how to fix it?
> 
> On Wed, Oct 7, 2009 at 12:20 AM, Lundgren, Andrew 
> mailto:andrew.lundg...@level3.com>> wrote:
> 
> Oh man, that should have read LOSING! 
> 
>  
> 
> *From:* lustre-discuss-boun...@lists.lustre.org 
>  
> [mailto:lustre-discuss-boun...@lists.lustre.org 
> ] *On Behalf Of 
> *Lundgren, Andrew
> *Sent:* Tuesday, October 06, 2009 11:14 AM
> *To:* lustre-discuss
> *Subject:* [Lustre-discuss] 1.8.0 Loosing connection to the MDT for 
> several minutes and then recovering.
> 
>  
> 
> We have a few 1.8.0 clusters running.  We have seen multiple instances 
> now where the clients lose connectivity to the MDT.  The MDS logs 
> indicate that there is some sort of problem on the MDT.
> 
>  
> 
> The following is a typical output:
> 
>  
> 
> Oct  6 02:56:08 mint1502 kernel: LustreError: 
> 28999:0:(handler.c:161:mds_sendpage()) @@@ bulk failed: timeout 0(4096), 
> evicting 7523f416-2579-5f49-cd3f-54d2438b8...@net_0x2ce213b0b_uuid
> 
> Oct  6 02:56:08 mint1502 kernel:   r...@8100ac9f4000 
> x1314647461000449/t0 
> o37->7523f416-2579-5f49-cd3f-54d2438b8...@net_0x2ce213b0b_uuid:0/0 
> lens 408/360 e 1 to 0 dl 1254797793 ref 1 fl Interpret:/0/0 rc 0/0
> 
> Oct  6 02:56:16 mint1502 kernel: Lustre: Request x1312747398000879 sent 
> from content-OST001d-osc to NID 207.123.49...@tcp 7s ago has timed out 
> (limit 7s).
> 
> Oct  6 02:56:16 mint1502 kernel: LustreError: 166-1: 
> content-OST001d-osc: Connection to service content-OST001d via nid 
> 207.123.49...@tcp was lost; in progress operations using this service 
> will fail.
> 
> Oct  6 02:56:16 mint1502 kernel: LustreError: Skipped 1 previous similar 
> message
> 
> Oct  6 02:56:17 mint1502 kernel: LustreError: 166-1: 
> content-OST001c-osc: Connection to service content-OST001c via nid 
> 207.123.49...@tcp was lost; in progress operations using this service 
> will fail.
> 
> Oct  6 02:56:17 mint1502 kernel: LustreError: Skipped 1 previous similar 
> message
> 
> Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT: A 
> client on nid 207.123.49...@tcp was evicted due to a lock blocking 
> callback to 207.123.49...@tcp timed out: rc -107
> 
> Oct  6 02:56:18 mint1502 kernel: LustreError: 138-a: content-MDT: A 
> client on nid 207.123.4...@tcp was evicted due to a lock blocking 
> callback to 207.123.4...@tcp timed out: rc -107
> 
> Oct  6 02:56:18 mint1502 kernel: BUG: soft lockup - CPU#2 stuck for 10s! 
> [ll_mdt_rdpg_04:28999]
> 
> Oct  6 02:56:18 mint1502 kernel: CPU 2:
> 
> Oct  6 02:56:18 mint1502 kernel: Modules linked in: mds(U) 
> fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) 
> mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) 
> lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) sunrpc(U) 
> bonding(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_tcpudp(U) 
> xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) 
> x_tables(U) dm_round_robin(U) dm_rdac(U) dm_multipath(U) video(U) sbs(U) 
> backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U) 
> acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) 
> i2c_i801(U) e1000e(U) sr_mod(U) i2c_core(U) i5000_edac(U) cdrom(U) 
> pata_acpi(U) shpchp(U) edac_mc(U) serio_raw(U) sg(U) pcspkr(U) 
> dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U) lpfc(U) 
> scsi_transport_fc(U) ahci(U) ata_piix(U) libata(U) mptsas(U) mptscsih(U) 
> mptbase(U) scsi_transport_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) 
> uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> 
> Oct  6 02:56:18 mint1502 kernel: Pid: 28999, comm: ll_mdt_rdpg_04 
> Tainted: G  2.6.18-92.1.17.el5_lustre.1.8.0smp #1
> 
> Oct  6 02:56:18 mint1502 kernel: RIP: 0010:[]  
> [] :obdclass:lustre_hash_for_each_empty+0x1f0/0x290
> 
> Oct  6 02:56:18 mint1502 kernel: RSP: 0018:8104c02ad850  EFLAGS: 
> 0206
> 
> Oct  6 02:56:18 mint1502 kernel: RAX: 810448dfd200 RBX: 
> 328c RCX: ba75
> 
> Oct  6 02:56:18 mint1502 kernel: RDX: 5e7d RSI: 
> 802f0d80 RDI: c200109d78cc
> 
> Oct  6 02:56:18 mint1502 kernel: RBP: 8860c8f2 R08: 
> 810001016e60 R09: 000

Re: [Lustre-discuss] MDS overload, why?

2009-10-09 Thread Michael Kluge
Hi Arne,

could be memory pressure and the OOM running and shooting at things. How
much memory does you server has?


Michael

Am Freitag, den 09.10.2009, 10:26 +0200 schrieb Arne Brutschy:
> Hi everyone,
> 
> 2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
> MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
> 2.6.9-78.0.22.
> 
> We were quite happy with lustre's performance, especially because
> bottlenecks caused by /home disk access were history.
> 
> Saturday, the cluster went down (= was inaccessible). After some
> investigation I found out that the reason seems to be an overloaded MDS.
> Over the following 4 days, this happened multiple times and could only
> be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.
> 
> The MDS did not respond to any command, if I managed to get a video
> signal (not often), load was >170. Additionally, 2 times kernel oops got
> displayed, but unfortunately I have to record of them.
> 
> The clients showed the following error:
> > Oct  8 09:58:55 majorana kernel: LustreError: 
> > 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> > r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 304/456 
> > e 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> > Oct  8 09:58:55 majorana kernel: LustreError: 
> > 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> > messages
> 
> So, my question is: what could cause such a load? The cluster was not
> exessively used... Is this a bug or a user's job that creates the load?
> How can I protect lustre against this kind of failure?
> 
> Thanks in advance,
> Arne 
> 
-- 

Michael Kluge, M.Sc.

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih


smime.p7s
Description: S/MIME cryptographic signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] MDS overload, why?

2009-10-09 Thread Arne Brutschy
Hi everyone,

2 months ago, we switched our ~80 node cluster from NFS to lustre. 1
MDS, 4 OSTs, lustre 1.6.7.2 on a rocks 4.2.1/centos 4.2/linux
2.6.9-78.0.22.

We were quite happy with lustre's performance, especially because
bottlenecks caused by /home disk access were history.

Saturday, the cluster went down (= was inaccessible). After some
investigation I found out that the reason seems to be an overloaded MDS.
Over the following 4 days, this happened multiple times and could only
be resolved by 1) killing all user jobs and 2) hard-resetting the MDS.

The MDS did not respond to any command, if I managed to get a video
signal (not often), load was >170. Additionally, 2 times kernel oops got
displayed, but unfortunately I have to record of them.

The clients showed the following error:
> Oct  8 09:58:55 majorana kernel: LustreError: 
> 3787:0:(events.c:66:request_out_callback()) @@@ type 4, status -5  
> r...@f6222800 x8702488/t0 o250->m...@10.255.255.206@tcp:26/25 lens 304/456 e 
> 0 to 1 dl 1254988740 ref 2 fl Rpc:N/0/0 rc 0/0
> Oct  8 09:58:55 majorana kernel: LustreError: 
> 3787:0:(events.c:66:request_out_callback()) Skipped 33 previous similar 
> messages

So, my question is: what could cause such a load? The cluster was not
exessively used... Is this a bug or a user's job that creates the load?
How can I protect lustre against this kind of failure?

Thanks in advance,
Arne 

-- 
Arne Brutschy
Ph.D. StudentEmailarne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6  Web  iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles   Tel  +32 2 650 3168
Avenue Franklin Roosevelt 50 Fax  +32 2 650 2715
1050 Bruxelles, Belgium  (Fax at IRIDIA secretary)

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss