[Lustre-discuss] ptlrpcd

2008-02-29 Thread Thomas Roth
Hi all, anybody knows what ptlrpcd is good for, what it might be doing? I've seen it eating 100% CPU on OSS where I reformated an OST, but also in other circumstances. Regards, Thomas ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org

[Lustre-discuss] Quota setup fails because of OST ordering

2008-03-03 Thread Thomas Roth
Hi all, after installing a Lustre test file system consisting of 34 OSTs, I encountered a strange error when trying to set up quotas: lfs quotacheck gave me an Input/Output error, while in /var/log/kern.log I found a Lustre error LustreError: 20807:0:(quota_check.c:227:lov_quota_check()) lov

Re: [Lustre-discuss] Identifier removed on Lustre 1.6.4.3

2008-04-25 Thread Thomas Roth
Thanks for your advice! Indeed, it was the lack of NIS on the MDS that caused that error. Cheers, Thomas ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Disappearing OSTs

2008-05-05 Thread Thomas Roth
. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department: Informationstechnologie Location: SB3

Re: [Lustre-discuss] MDS Fail-Over planning.

2008-05-07 Thread Thomas Roth
-- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 Gesellschaft für Schwerionenforschung mbH Planckstraße 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der

[Lustre-discuss] free space of Lustre fs: unknown quantity?

2008-05-07 Thread Thomas Roth
Hi all, on our test cluster I observe a rather large difference between df and lfs df: lfs df /lustre UUID 1K-blocksUsed AvailableUse% Mounted on ... filesystem summary: 137454163496 7056126648

[Lustre-discuss] Size of MDT, used space

2008-05-13 Thread Thomas Roth
Hi all, I'm still in trouble with numbers: the available, used and necessary space on my MDT: According to lfs df, I have now filled my file system with 115.3 TB. All of these files are sized 5 MB. That should be roughly 24 million files. For the MDT, lfs df reports 28.2 GB used. Now I believed

[Lustre-discuss] no file creation on not yet full Lustre

2008-07-02 Thread Thomas Roth
Hi all, on a Lustre FS v1.6.5 with Debian Etch kernel 2.6.22 that is 94% full, I can't create any more files. Each OST has still room for ~ 2GB: # lfs df UUID 1K-blocks Used Available Use% Mounted on MDT_UUID 495497804669376 4665119240% /lustre[MDT:0]

Re: [Lustre-discuss] no file creation on not yet full Lustre

2008-07-02 Thread Thomas Roth
Oh, that was one of the (many) things I skipped when reading the manual. Seems to have done the trick, thanks a lot. Thomas Guy Coates wrote: Thomas Roth wrote: Hi all, on a Lustre FS v1.6.5 with Debian Etch kernel 2.6.22 that is 94% full, I can't create any more files. Each OST has still

[Lustre-discuss] OSS crashes

2008-07-23 Thread Thomas Roth
Hi all, I've experienced reproducible OSS crashes with 1.6.5 but also 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. The OSS are file servers with two OSTs. I'm now testing it by just using one OSS in the system (but encountered the problem first with 9 OSS), mounting

Re: [Lustre-discuss] OSS crashes

2008-07-23 Thread Thomas Roth
Hi, Brian J. Murrell wrote: On Wed, 2008-07-23 at 13:56 +0200, Thomas Roth wrote: Hi all, Hi, I've experienced reproducible OSS crashes with 1.6.5 but also 1.6.4.3/1.6.4.2. The cluster is running Debian Etch64, kernel 2.6.22. The OSS are file servers with two OSTs. I'm now testing

Re: [Lustre-discuss] OSS crashes

2008-07-24 Thread Thomas Roth
-- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 Gesellschaft für Schwerionenforschung mbH Planckstraße 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung

[Lustre-discuss] LustreError: acquire timeout exceeded

2008-07-29 Thread Thomas Roth
Hi all, I've encountered a LustreError that might have triggered an unwanted failover of a MGS/MGD -HA-pair of servers. I'm not sure about the latter, but at least I have not found a trace of that error via Google, so it might be worth considering. And it occurred in this form only the two

[Lustre-discuss] More: OSS crashes

2008-07-31 Thread Thomas Roth
Hi all, I'm still successful in bringing my OSSs to a standstill if not crashing them. Having reduced the number of stress jobs writing to Lustre (stress -d 2 --hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of OSS threads (options ost oss_num_threads=256 in

[Lustre-discuss] operation X on unconnected MDS / MGS

2008-08-01 Thread Thomas Roth
Hi all, for some time already I keep getting error messages on my MGS/MDS about operations on unconnected MGS or MDS. Aug 1 18:08:53 kernel: LustreError: 14182:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS Aug 1 18:09:19 kernel: LustreError:

[Lustre-discuss] Lustre directory sizes - fast du

2008-09-04 Thread Thomas Roth
Hi all, since our users have managed to write several TBs to Lustre by now, they sometimes would like to know what and how much there is in their directories. Is there any smarter way to find out than to do a du -hs dirname and wait for 30min for the 12TB-answer ? I've already told them to

Re: [Lustre-discuss] lustre/drbd/heartbeat setup [was: drbd async mode]

2008-10-13 Thread Thomas Roth
Hi, read your instructions - that's pretty much the setup we are using, too. And it works very well, drbd 0.8 non-withstanding, but on a hardware raid. I do not quite understand your remark about not using an extra net for drbd - have you tried putting the name that's in your drbd.conf together

[Lustre-discuss] LBUG mds_reint.c, questions about recovery time

2008-10-13 Thread Thomas Roth
Hi all, I just ran into a LBUG on an MDS still running Lustre Version 1.6.3 with kernel 2.6.18, Debian Etch. kern.log c.f. below. You will probably tell me that is a known BUG already fixed/ to be fixed (I'm unsure how to search for such a thing in bugzilla). But my main question concerns the

[Lustre-discuss] setquota fails

2008-11-28 Thread Thomas Roth
Hi all, on an empty and unused Lustre 1.6.5.1 system I cannot reset or set the quota: ~# lfs quota -u troth /lustre Disk quotas for user troth: Filesystem kbytes quota limit grace files quota limit grace /lustre 4 3072000 309200 1 11000

Re: [Lustre-discuss] MDS: lock timed out -- not entering recovery in server code, just going back to sleep

2008-12-02 Thread Thomas Roth
Brian J. Murrell wrote: On Thu, 2008-11-27 at 19:18 +0100, Thomas Roth wrote: Nov 27 17:57:41 lustre kernel: LustreError: 3974:0:(ldlm_request.c:64:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1227804060, 1001s ago); not entering recovery in server code, just going

[Lustre-discuss] More: setquota fails, mds adjust qunit failed

2008-12-04 Thread Thomas Roth
with alt-sysrq-t provided sysctl variable kerne.sysrq equals 1)? Andrew. On Friday 28 November 2008 17:50:51 Thomas Roth wrote: Hi all, on an empty and unused Lustre 1.6.5.1 system I cannot reset or set the quota: ~# lfs quota -u troth /lustre Disk quotas for user troth

[Lustre-discuss] setquota fails, mds adjust qunit failed, quota_interface.c ... quit checking

2008-12-04 Thread Thomas Roth
or mds_quota_recovery as suggested by Andrew yet found nothing. So, what should I do? Unfortunately the system is already in use by other people, so just starting fresh with a global mkfs.lustre is not an option ;-) Regards, Thomas Thomas Roth wrote: Hi, I'm still having these problems

[Lustre-discuss] Quota Info lost

2008-12-18 Thread Thomas Roth
Hi all, we somehow lost our quota info - a check of the quotas of a user gave me user quotas are not enabled This system, running Lustre 1.6.5.1, was set up end of October, by which time I had also enabled quotas there. Of course I had also run some tests then which showed that quotas were

[Lustre-discuss] Connection losses to MGS/MDS

2008-12-18 Thread Thomas Roth
Hi all, in a cluster with 375 clients, for a 12 hour period I get about 500 messages of the type Connection to service MGS via nid a.b@tcp was lost; in progress operations using this service will fail. and about 800 messages of the type Connection to service MDT via nid

Re: [Lustre-discuss] Connection losses to MGS/MDS

2008-12-19 Thread Thomas Roth
. Regards, Thomas cheers Wojciech Thomas Roth wrote: Hi all, in a cluster with 375 clients, for a 12 hour period I get about 500 messages of the type Connection to service MGS via nid a.b@tcp was lost; in progress operations using this service will fail. and about 800

[Lustre-discuss] Upgrade-Procedure

2009-01-06 Thread Thomas Roth
Hi all, I've just upgraded a 1 MDT - 2 OST - 2 Clients - test cluster from Lustre version 1.6.5.1 to 1.6.6 However, I did not follow the manual ( http://manual.lustre.org/manual/LustreManual16_HTML/UpgradingLustre.html#50548855_pgfId-1289726): I did not use the tunefs.lustre command on MGS/MDT

Re: [Lustre-discuss] 1.6.5.1 - 1.6.6

2009-01-08 Thread Thomas Roth
Hi Wojciech, thanks for your explanation of writeconf. Of course, the man pages as well as the Lustre manual state that writeconf is a potentially dangerous operation. I'm always afraid to manipulate the MDT: might go wrong and I end up with 100's of TB of junk (as a restore of backups never

[Lustre-discuss] Lustre MDS Errors 1-7 and operation 101

2009-01-14 Thread Thomas Roth
Hi all, on our production cluster we have for a surprisingly long time ( 1 day) only the following two error messages (and no visible problems), although the system is under heavy load right now: Jan 14 10:44:33 server1 kernel: LustreError: 5118:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@

Re: [Lustre-discuss] Lustre MDS Errors 1-7 and operation 101

2009-01-15 Thread Thomas Roth
Lustre talks about an unconnected MGS. Thomas Cliff White wrote: Thomas Roth wrote: Hi all, on our production cluster we have for a surprisingly long time ( 1 day) only the following two error messages (and no visible problems), although the system is under heavy load right now: Jan 14 10

Re: [Lustre-discuss] Lustre MDS Errors 1-7 and operation 101

2009-01-15 Thread Thomas Roth
things that might cause operation 101 problems. 101 = LDLM_ENQUEUE, so this is just a lock enqueue. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- Thomas Roth

Re: [Lustre-discuss] Removing an OST

2009-01-19 Thread Thomas Roth
/listinfo/lustre-discuss -- Thomas Roth Gesellschaft für Schwerionenforschung Planckstr. 1- 64291 Darmstadt, Germany Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax

[Lustre-discuss] root squash nodes

2009-02-19 Thread Thomas Roth
Hi all, from some prior mail here I took it that you can have only one client non-root-squashed. Now in the manual , http://manual.lustre.org/manual/LustreManual16_HTML/LustreSecurity.html#50548870_pgfId-1292749 there is talk of multiple NIDs to be used here: lctl conf_param

[Lustre-discuss] Recovery without end

2009-02-25 Thread Thomas Roth
Hi all, we have a problem with our production system (v. 1.6.5.1). It is in recovery, but recovery never finishes. The background are some unknown problems with the MDT, attempts to restart the MDS etc. The MDT would start recovery, at some point during recovery lose connection to its OSTs,

Re: [Lustre-discuss] Recovery without end

2009-02-25 Thread Thomas Roth
entries on my MGS/MDT server, mgs, mgc mdt lov mds. The correct device name for the lctl command is the one after mds. Regards, Thomas Thomas Roth wrote: Hi all, we have a problem with our production system (v. 1.6.5.1). It is in recovery, but recovery never finishes. The background are some

[Lustre-discuss] MDS Recovery: which client

2009-02-27 Thread Thomas Roth
Hi all, just to repeat my question without further surrounding facts and doubts; How can I find out which client is currently being recovered? (If the MDS is in recovery at that moment) How to find out which client is not recoverable (if recovery gets stuck)? The MDS seems to know, because

[Lustre-discuss] MDS refuses connections (no visible reason)

2009-03-05 Thread Thomas Roth
Hi all, after running for days without any problems, our MDS is refusing cooperation for two hours now. The log files show nothing until Mar 5 16:46:24 mds1 kernel: Lustre: 17841:0:(ldlm_lib.c:525:target_handle_reconnect()) MDT: 481fa70b-590d -31b6-f621-c6125a54bfff reconnecting Mar 5

[Lustre-discuss] Aborting recovery

2009-03-05 Thread Thomas Roth
Hi all, after the problem with our MDS I reported earlier this evening, I did indeed a complete restart of the server, so afterwards the MDS was in recovery. There were 386 clients to be recovered. After 100 min, only one of these was left, but obviously it never came back. So I aborted

Re: [Lustre-discuss] Aborting recovery

2009-03-06 Thread Thomas Roth
Thanks Brian. Brian J. Murrell wrote: On Thu, 2009-03-05 at 22:19 +0100, Thomas Roth wrote: My question: what happens to the one client that was not recovered? It, and all of the clients that have transactions that need to be replayed after the AWOL client's transactions are all evicted

Re: [Lustre-discuss] Aborting recovery

2009-03-06 Thread Thomas Roth
Brian J. Murrell wrote: On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote: Thanks Brian. NP. What I meant: the average batch job that wants to read from or write to Lustre will abort if a file cannot be accessed. The reason doesn't matter to the jobs or the user. That may be so

Re: [Lustre-discuss] Aborting recovery

2009-03-06 Thread Thomas Roth
Brian J. Murrell wrote: On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote: But this is not what our users observe. Even on an otherwise perfectly working system, they report I/O errors on access to some files. EIO == eviction. I can usually see something happening in the logs of OST

[Lustre-discuss] MDT connection refusal: still busy with 2 active RPCs

2009-04-09 Thread Thomas Roth
Hi all, we are suffering from an increasing unusability of our cluster due to refused connections, with typical log entries on the MDS: ldlm_lib.ctarget_handle_connect lustre-MDT: refuse reconnection from 77cbd453-ee72-fe75-cb06-c49179e0a...@lustre-client@tcp to 0x810111341000; still

Re: [Lustre-discuss] Has anyone had experience with heartbeat and drdb providing full redundancy on lustre clusters

2009-04-23 Thread Thomas Roth
@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für

[Lustre-discuss] quotacheck blows up MDT

2009-04-24 Thread Thomas Roth
Hi all, in a recent shutdown of our Lustre cluster (net reconfig, Version upgrade to 1.6.7_patched), I decided to try to switch on quotas - this had failed when the cluster went operational last year. Again, I suffered from the same error as last year - failure, and device/resource busy. This

Re: [Lustre-discuss] quotacheck blows up MDT

2009-04-24 Thread Thomas Roth
___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49

[Lustre-discuss] new errors w adaptive timeouts?

2009-04-24 Thread Thomas Roth
Hi all, since the upgrade to version 1.6.7._patched, our MDT prints huge amounts of: 0100:0400:4:1240598372.080847:0:4842:0:(service.c:753:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (42/30), not sending early reply r...@8107faf6f450 x18776/t0

[Lustre-discuss] quota problem 1.6.7.1

2009-05-18 Thread Thomas Roth
Hi all, after several attempts to enable quotas on our 1.6.7.1-cluster - alway blows up the MDS, I've complained earlier on this list - the MDT now keeps sending these messages lustre kernel: LustreError: 3825:0:(ldlm_lib.c:1840:target_handle_dqacq_callback()) dqacq failed! (rc:-5)

[Lustre-discuss] MDT crash: ll_mdt at 100%

2009-07-02 Thread Thomas Roth
Hi all, our MDT gets stuck and unresponsive with very high loads (Lustre 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling attention is one ll_mt_?? process running with 100% cpu. Nothing unusual happening on the cluster before that. After reboot as well as after moving the

Re: [Lustre-discuss] MDT crash: ll_mdt at 100%

2009-07-03 Thread Thomas Roth
and possible do a writeconf everywhere. I see that a similar problem was reported by Mag in March this year, but no clues or solutions appeared. Any ideas? Yours, Thomas -- Thomas Roth Department: Informationstechnologie

Re: [Lustre-discuss] MDT crash: ll_mdt at 100%

2009-07-03 Thread Thomas Roth
-- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453  Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 D-64291 Darmstadt www.gsi.de

[Lustre-discuss] MDT move aka backup w rsync

2009-07-15 Thread Thomas Roth
Hi all, I want to move a MDT from one server to another. After studying some mails concerning MDT backup, I've just tried (successfully, it seems) to do that on a small test system with rsync: - Stop Lustre, umount all servers. - Format a suitable disk partition on the new hardware, using the

Re: [Lustre-discuss] MDT move aka backup w rsync

2009-07-16 Thread Thomas Roth
Andreas Dilger wrote: On Jul 15, 2009 18:35 +0200, Thomas Roth wrote: ... The traditional backup method of getting the EAs and tar-ing the MDT doesn't finish in finite time. It did before, and the filesystem has since grown by a mere 40GB of data, so it shouldn't take that much longer

[Lustre-discuss] client missing inodes

2009-07-21 Thread Thomas Roth
Hi all, recently, many of our clients have been reporting errors of the type: Jul 21 14:40:41 kernel: LustreError: 26871:0:(file.c:3024:ll_inode_revalidate_fini()) failure -2 inode 72692692 -2 is no such file or directory Am I right that this means some inode info got lost on the MDT? Also,

[Lustre-discuss] File sizes on MDT

2009-07-27 Thread Thomas Roth
' on the MDT as any practical implication? -- Thomas Roth IT-Department Location: SB3 1.262 Phone: +49-6159-71 1453 'We apologise for the inconvenience.' ___ Lustre-discuss

Re: [Lustre-discuss] File sizes on MDT

2009-07-28 Thread Thomas Roth
Hi Andreas, Andreas Dilger wrote: On Jul 27, 2009 14:24 +0200, Thomas Roth wrote: I'm copying around data between 2 MDTs in a test system. Having mounted the partitione as 'ldiskfs', I had a look in MDT/ROOT. I found all my test data there, but I'm puzzled by the indicated file sizes

Re: [Lustre-discuss] Inode errors at time of job failure

2009-08-06 Thread Thomas Roth
Hi, these ll_inode_revalidate_fini errors are unfortunately quite known to us. So what would you guess if that happens again and again, on a number of clients - MDT softly dying away? Because we haven't seen any mass evictions (and no reasons for that) in connection with these errors. Or could the

Re: [Lustre-discuss] Inode errors at time of job failure

2009-08-07 Thread Thomas Roth
correlated with this fsck ;-)). So I'm still not reassured concerning the health of this MDT. We are running Lustre v 1.6.7.2 on the servers, the clients mainly still on 1.6.5.1. Regards, Thomas Oleg Drokin wrote: Hello! On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote: Hi

[Lustre-discuss] changed server handles on OST and EIOs

2009-08-19 Thread Thomas Roth
what's going on here? TIA, Thomas -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 D-64291

[Lustre-discuss] Lustre and kernel vulnerability CVE-2009-2692

2009-08-21 Thread Thomas Roth
Hi all, while trying to fix the recent kernel vulnerability (CVE-2009-2692) we found that in most cases, our Lustre 1.6.5.1, 1.6.6 and 1.6.7.2 clients seemed to be quite well protected, at least against the published exploit: wunderbar_emporium seems to work, but then the root shell never

Re: [Lustre-discuss] Lustre and kernel vulnerability CVE-2009-2692

2009-08-21 Thread Thomas Roth
Peter Kjellstrom wrote: On Friday 21 August 2009, Thomas Roth wrote: Hi all, while trying to fix the recent kernel vulnerability (CVE-2009-2692) we found that in most cases, our Lustre 1.6.5.1, 1.6.6 and 1.6.7.2 clients seemed to be quite well protected, at least against the published

[Lustre-discuss] LustreError: ptlrpc body, buffer size, message magic

2009-09-21 Thread Thomas Roth
Hi all, on our 1.6.7.2 system, the MDT is quite busy writing the following type of messages to the log, and I would just like to ask if somebody has an idea what they mean and if they mean harm: Sep 21 19:50:30 mds1 kernel: LustreError: 6009:0:(pack_generic.c:566:lustre_msg_buf_v2()) msg

[Lustre-discuss] Bad distribution of files among OSTs

2009-10-30 Thread Thomas Roth
Hi all, in our 196 OST - Cluster, the previously perfect distribution of files among the OSTs is not working anymore, since ~ 2 weeks. The filling for most OSTs is between 57% and 62%, but some (~10) have risen up to 94%. I'm trying to fix that by having these OSTs deactivated on the MDT and

Re: [Lustre-discuss] Bad distribution of files among OSTs

2009-10-31 Thread Thomas Roth
quite a number of lists of user's data from running lfs find --obd OST... /lustre/..., I just haven't run these lists through a ls -lh yet. To busy moving the files instead of measuring them ;-) Regards, Thomas Andreas Dilger wrote: On 2009-10-30, at 12:07, Thomas Roth wrote: in our 196 OST

Re: [Lustre-discuss] Bad distribution of files among OSTs

2009-11-01 Thread Thomas Roth
levels of the OSTs. Regards, Thomas Andreas Dilger wrote: On 2009-10-30, at 12:07, Thomas Roth wrote: in our 196 OST - Cluster, the previously perfect distribution of files among the OSTs is not working anymore, since ~ 2 weeks. The filling for most OSTs is between 57% and 62%, but some (~10

[Lustre-discuss] OSS extremely slow in response, ll_ost load high

2009-11-03 Thread Thomas Roth
Hi all, in our 1.6.7.2 - Debian- Kernel 2.6.22 Cluster, 2 Servers with 2 and 3 OSTs have become somewhat blocking in the sense that commands like lfs df will have to wait for ca. 30s when reaching these OSTs in the list. Some of our clients do not have this problem, some have these contact(?)

[Lustre-discuss] Lots of No ctxt after OST crush

2010-04-26 Thread Thomas Roth
: LustreError: 13531:0:(mds_lov.c:960:__mds_lov_synchronize()) gsilust-OST00b3_UUID sync failed -2, deactivating -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI

[Lustre-discuss] lock callback timer expired, lock on destroyed export, locks stolen, busy with active RPCs, operation 400 on unconnected MDS

2010-05-03 Thread Thomas Roth
-- Thomas Roth Department: Informationstechnologie GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung

[Lustre-discuss] lnet infiniband config

2010-06-22 Thread Thomas Roth
Hi all, I'm getting my feet wet in the infiniband lake and of course I run into some problems. It would seem I got the compilation part of sles11 kernel 2.6.27 + Lustre 1.8.3 + ofed 1.4.2 right, because it allows me to see and use the infiniband fabric, and because ko2iblnd loads without any

Re: [Lustre-discuss] lnet infiniband config

2010-06-22 Thread Thomas Roth
on distinct networks. lctl list_nids will show you the lustre nids of the node you're logged into only. lctl route_list will show you the lustre routers and the networks that they bridge. I hope this was helpful. Erik On Tue, Jun 22, 2010 at 10:19 AM, Thomas Roth t.r...@gsi.de wrote: Hi

Re: [Lustre-discuss] lnet infiniband config

2010-06-30 Thread Thomas Roth
On 22.06.2010 16:19, Thomas Roth wrote: Hi all, I'm getting my feet wet in the infiniband lake and of course I run into some problems. It would seem I got the compilation part of sles11 kernel 2.6.27 + Lustre 1.8.3 + ofed 1.4.2 right, because it allows me to see and use the infiniband fabric

[Lustre-discuss] error e2fsck run for lfsck

2010-09-10 Thread Thomas Roth
Hi all, on a 1.8.4 test system, I tried prepare for lfsck and got an error from e2fsck: mds:~# e2fsck -n -v --mdsdb /tmp/mdsdb /dev/sdb2 e2fsck 1.41.10.sun2 (24-Feb-2010) lustre-MDT lustre database creation, check forced. Pass 1: Checking inodes, blocks, and sizes MDS: ost_idx 0 max_id

Re: [Lustre-discuss] error e2fsck run for lfsck

2010-09-18 Thread Thomas Roth
Thanks, Daniel. I have tried on another test system without pools, and there it worked indeed. Regards, Thomas On 09/10/2010 08:48 PM, Daniel Kobras wrote: Hi Thomas! On Fri, Sep 10, 2010 at 08:16:57PM +0200, Thomas Roth wrote: on a 1.8.4 test system, I tried prepare for lfsck and got

[Lustre-discuss] Question about adaptive timeouts, not sending early reply

2010-09-18 Thread Thomas Roth
Hi all, I'm trying to understand MDT logs and adaptive timeouts. After upgrade to 1.8.4 and while users believed Lustre to be still in maintenance (= no activity), the MDT log just shows Lustre: 19823:0:(service.c:808:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (42/30), not sending

Re: [Lustre-discuss] ls does not work on ram disk for normal user

2010-09-22 Thread Thomas Roth
-- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt

[Lustre-discuss] mkfs.lustre fails, ldiskfs: ext4 or ext3 ?

2010-11-03 Thread Thomas Roth
? Cheers, Thomas -- Thomas Roth Department: IT/HPC GSI Darmstadt ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] [robinhood-support] robinhood error messages

2010-11-24 Thread Thomas Roth
. Sujet: [Lustre-discuss] robinhood error messages Expéditeur: Thomas Roth t.r...@gsi.de Date: Tue, 23 Nov 2010 20:20:33 +0100 Destinataire: lustre-discuss@lists.lustre.org Destinataire: lustre-discuss

Re: [Lustre-discuss] [robinhood-support] robinhood error messages

2010-11-24 Thread Thomas Roth
On 24.11.2010 15:17, LEIBOVICI Thomas wrote: Thomas Roth wrote: ListMgr | DB query failed in ListMgr_Insert line 340... and assorted messages, which seem to indicate that the new robinhood scan tries to put something into the DB that is already there, and stumbles on this. Or maybe

[Lustre-discuss] MDT raid parameters, multiple MGSes

2011-01-21 Thread Thomas Roth
Hi all, we have gotten new MDS hardware, and I've got two questions: What are the recommendations for the RAID configuration and formatting options? I was following the recent discussion about these aspects on an OST: chunk size, strip size, stride-size, stripe-width etc. in the light of the

[Lustre-discuss] llverfs outcome

2011-01-27 Thread Thomas Roth
Hi all, I have run llverfs (lustre-utils 1.8.4) on an OST partition as llverfs -w -v /srv/OST0002. That went smoothly until all 9759209724 kB were written, terminating with: write File name: /srv/OST0002/dir00072/file022 write complete llverfs: writing /srv/OST0002/llverfs.filecount failed :No

Re: [Lustre-discuss] llverfs outcome

2011-01-31 Thread Thomas Roth
partitions.. Cheers, Thomas On 27.01.2011 20:06, Andreas Dilger wrote: On 2011-01-27, at 04:56, Thomas Roth wrote: I have run llverfs (lustre-utils 1.8.4) on an OST partition as llverfs -w -v /srv/OST0002. That went smoothly until all 9759209724 kB were written, terminating

Re: [Lustre-discuss] Migrating MDT volume to a new location

2011-02-03 Thread Thomas Roth
-- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt

[Lustre-discuss] MDT extremely slow after restart

2011-04-02 Thread Thomas Roth
Hi all, we are suffering from a sever metadata performance degradation on our 1.8.4 cluster and are pretty clueless. - We moved the MDT to a new hardware, since the old one was failing - We increased the size of the MDT with 'resize2fs' (+ mounted it and saw all the files) - We found the

Re: [Lustre-discuss] MDT extremely slow after restart

2011-04-04 Thread Thomas Roth
change when you switched hardware? The 'still busy' message is a bug, may be fixed in 1.8.5 cliffw On Sat, Apr 2, 2011 at 1:01 AM, Thomas Roth t.r...@gsi.de mailto:t.r...@gsi.de wrote: Hi all, we are suffering from a sever metadata performance degradation on our 1.8.4

Re: [Lustre-discuss] aacraid kernel panic caused failover

2011-04-06 Thread Thomas Roth
in the private sector. They expect results. -Ray Ghostbusters -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung

Re: [Lustre-discuss] aacraid kernel panic caused failover

2011-04-06 Thread Thomas Roth
! You've never been out of college! You don't know what it's like out there! I've worked in the private sector. They expect results. -Ray Ghostbusters -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262

[Lustre-discuss] high OSS load - readcache_max_filesize

2011-05-05 Thread Thomas Roth
Hi all, a recent posting here (which I can't find atm) has pointed me to http://jira.whamcloud.com/browse/LU-15, where an issue is discussed that we seem to see as well: some OSS really get overloaded, and the log says slow journal start 36s due to heavy IO load slow commitrw commit 36s due to

[Lustre-discuss] Mount 2 clusters, different networks - LNET tcp1-tcp2-o2ib

2011-06-14 Thread Thomas Roth
Hi all, I'd like to mount two Lustre filesystems on one client. Issues with more than one MGS set aside, the point here is that one of them is an Infiniband-cluster, the other is ethernet-based. And my client is on the ethernet. I have managed to mount the o2ib-fs by setting up an LNET router,

Re: [Lustre-discuss] Mount 2 clusters, different networks - LNET tcp1-tcp2-o2ib

2011-06-14 Thread Thomas Roth
@tcp Cheers, Thomas On 06/14/2011 07:00 PM, Michael Shuey wrote: Is your ethernet FS in tcp1, or tcp0? Your config bits indicate the client is in tcp1 - do the servers agree? -- Mike Shuey On Tue, Jun 14, 2011 at 12:23 PM, Thomas Roth t.r...@gsi.de wrote: Hi all, I'd like

Re: [Lustre-discuss] Mount 2 clusters, different networks - LNET tcp1-tcp2-o2ib

2011-06-14 Thread Thomas Roth
every 300 seconds (and re-enables it if found). Hope this helps. -- Mike Shuey On Tue, Jun 14, 2011 at 1:26 PM, Thomas Roth t.r...@gsi.de wrote: Hm, the ethernet FS is in tcp0 - MGS says its nids are MGS-IP@tcp. So not surprising it refuses that connection. On the other hand

Re: [Lustre-discuss] Mount 2 clusters, different networks - LNET tcp1-tcp2-o2ib - solved?

2011-06-14 Thread Thomas Roth
=statement seems to say: If you have data for tcp, use the Default-Router-IP and go via the interace that is on network tcp1. Oh well, I should probably take some networking lectures... Regards, Thomas On 06/14/2011 06:23 PM, Thomas Roth wrote: Hi all, I'd like to mount two Lustre filesystems

[Lustre-discuss] Emptied OSTs not empty

2011-06-27 Thread Thomas Roth
Hi all, I am currently moving off files of a number of OSTs - some in a machine with a predicted hardware failure, some for decommissioning old hardware etc. I'm deactivating the OSTs on the MDS, then lfs find --obd OST_UUID /dir to create a list of file to migrate. When finished, the

Re: [Lustre-discuss] problem with clients and multiple transports

2011-11-09 Thread Thomas Roth
Your clients have both ib and tcp nids? Because I encountered a strange behavior trying to mount an ib based FS and a tcp based FS on the same (ethernet-only) client. To connect to the ib MDS it had to go through a lnet router, of course. Experimentally, I found options lnet

Re: [Lustre-discuss] Problems with lustre router setup IB - TCP

2011-12-23 Thread Thomas Roth
://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1

Re: [Lustre-discuss] removing ost

2012-03-24 Thread Thomas Roth
? ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department: Informationstechnologie

Re: [Lustre-discuss] Lustre on Debian

2012-04-06 Thread Thomas Roth
Hi Marinho, no problem for Lustre 1.8. All the necessary packages are here: http://pkg-lustre.alioth.debian.org/backports/lustre-1.8.7-wc1-squeeze/ We've been running Lustre on Debian since version 1.5.9 (aka beta for 1.6, on Sarge! ;-)). Now we are at 3.5 PB, on 200+ servers. No Debian-specific

Re: [Lustre-discuss] Problems getting Lustre started with ZFS

2013-10-26 Thread Thomas Roth
___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 1453 Fax

Re: [Lustre-discuss] how do I deactivate a very wonky OST

2015-01-22 Thread Thomas Roth
: 831-656-6238 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Thomas Roth Department

[lustre-discuss] lnet peer credits

2016-08-01 Thread Thomas Roth
8 -419 0 (The last line, the only peer that is "up", is an LNET-router) Something to worry about? Cheers, Thomas -- ---- Thomas Roth Department: Informationstechnologie Location: SB3 1.250 Phone: +49-6159-71 1453 Fax: +

[lustre-discuss] client server communication half-lost, read-out?

2016-08-01 Thread Thomas Roth
shark et al.? Cheers, Thomas -- -------- Thomas Roth Department: Informationstechnologie Location: SB3 1.250 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Si

[lustre-discuss] ZFS not freeing disk space

2016-08-10 Thread Thomas Roth
I think I have seen this behavior before, and the "df" result shrank to an expected value after the server had been rebooted. In that case, this seems more like a too persistent caching effect -? Cheers, Thomas -- ---

Re: [lustre-discuss] MDT quota problem / MDS crash 2.5.3

2016-07-14 Thread Thomas Roth
-Unterstützung (mittels 'tune2fs -O ^quota' und anschließendem 'tunefs.lustre --quota') auf dem MDT konnten wir es wieder reparieren. Vielleicht hilft das bei Euch auch... On Tue, 12 Jul 2016, Thomas Roth wrote: Hi all, we are running Lustre 2.5.3 on our servers. OSTs are on ZFS, MDS is on ldiskfs

  1   2   3   >