[Lustre-discuss] small cluster

2008-09-05 Thread Paolo Supino
Hi

  I'm new to the world of HPC and cluster :-( and I was given the task of
setting up a new (small) HPC cluster based on 36 Sun X2200 M2 systems. Each
system as 1 136GB SAS HD. I've installed CentOS 5.2 as the base OS on all
the systems and through partitinoing managed to have a unused partition size
65GB (and with 36 nodes I get +2TB of distributed storage). Tthe rest of the
HD is as follows: 4xRAM = 64GB swap and ~10GB for the OS itself. I want ot
put a Clustered FS on the unused partitions and from looking around seems
that Lustre is king of the hill ...
  The question I have is: Will I gain anything from using Lustre in such a
tight environemnt or am I going to degrade the cluster's performnce that it
simply not worth the effort?



--
TIA
Paolo
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] small cluster

2008-09-05 Thread Kevin Van Maren
Paolo,

The biggest problem with using Lustre in this environment is that the 
Lustre _servers_ (those providing the disk space) cannot _use_ the 
filesystem (cannot have a client and a server on same machine).

Kevin


Paolo Supino wrote:
 Hi

   I'm new to the world of HPC and cluster :-( and I was given the task 
 of setting up a new (small) HPC cluster based on 36 Sun X2200 M2 
 systems. Each system as 1 136GB SAS HD. I've installed CentOS 5.2 as 
 the base OS on all the systems and through partitinoing managed to 
 have a unused partition size 65GB (and with 36 nodes I get +2TB of 
 distributed storage). Tthe rest of the HD is as follows: 4xRAM = 64GB 
 swap and ~10GB for the OS itself. I want ot put a Clustered FS on the 
 unused partitions and from looking around seems that Lustre is king of 
 the hill ...
   The question I have is: Will I gain anything from using Lustre in 
 such a tight environemnt or am I going to degrade the cluster's 
 performnce that it simply not worth the effort?



 --
 TIA
 Paolo
   
 

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
   

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre clients failing, and cant reconnect

2008-09-05 Thread Brian J. Murrell
On Fri, 2008-09-05 at 00:15 -0400, Brock Palen wrote:
 Looks like that didn't fix it.  One of the login nodes repeated the  
 behavior.

So what are the messages the client logged when the problem occurred?
And what, if anything was logged on the MDS at the same time?

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre clients failing, and cant reconnect

2008-09-05 Thread Brock Palen
I had to reboot the MDS to get the problem to go away.
I will watch and see if it reappears. I screwed up and deleted the  
wrong /var/log/messages  So I don't have the messages.

I am watching this issues.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



On Sep 5, 2008, at 10:01 AM, Brian J. Murrell wrote:
 On Fri, 2008-09-05 at 00:15 -0400, Brock Palen wrote:
 Looks like that didn't fix it.  One of the login nodes repeated the
 behavior.

 So what are the messages the client logged when the problem occurred?
 And what, if anything was logged on the MDS at the same time?

 b.

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre clients failing, and cant reconnect

2008-09-05 Thread Jerome, Ron
For what it's worth...  I've seen similar problems with clients not
being able to connect to OSSs

SERVER OS: Linux oss1 2.6.18-53.1.14.el5_lustre.1.6.5.1smp #1 SMP Thu
Jun 26 01:38:50 EDT 2008 i686 i686 i386 GNU/Linux
CLIENT OS: Linux x15 2.6.18-53.1.14.el5_lustre.1.6.5smp #1 SMP Mon May
12 22:24:24 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux


On the client side I see this...

  CLIENT LOG
=
Sep  1 15:17:22 x15 kernel: Lustre: Request x30990319 sent from
data-OST0004-osc-81022067ec00 to NID [EMAIL PROTECTED] 100s ago has
timed out (limit
100s).
Sep  1 15:17:22 x15 kernel: Lustre: Skipped 9 previous similar messages
Sep  1 15:17:22 x15 kernel: Lustre: data-OST0004-osc-81022067ec00:
Connection to service data-OST0004 via nid [EMAIL PROTECTED] was lost;
in progress
 operations using this service will wait for recovery to complete.
Sep  1 15:17:22 x15 kernel: LustreError:
3834:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel
RPC: canceling anyway
Sep  1 15:17:22 x15 kernel: LustreError:
3834:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -11
Sep  1 15:17:22 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:17:22 x15 kernel: LustreError: Skipped 2 previous similar
messages
Sep  1 15:17:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-81022067ec00: tried all connections, increasing
 latency to 6s
Sep  1 15:17:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 4 previous
similar messages
Sep  1 15:17:47 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:18:37 x15 last message repeated 2 times
Sep  1 15:19:02 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:19:27 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-81022067ec00: tried all connections, increasing
 latency to 26s
Sep  1 15:19:27 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 3 previous
similar messages
Sep  1 15:19:27 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:20:17 x15 last message repeated 2 times
Sep  1 15:21:07 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:21:07 x15 kernel: LustreError: Skipped 1 previous similar
message
Sep  1 15:21:57 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-81022067ec00: tried all connections, increasing
 latency to 51s
Sep  1 15:21:57 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 5 previous
similar messages
Sep  1 15:22:22 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:22:22 x15 kernel: LustreError: Skipped 2 previous similar
messages
Sep  1 15:24:52 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:24:52 x15 kernel: LustreError: Skipped 5 previous similar
messages
Sep  1 15:27:22 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-81022067ec00: tried all connections, increasing
 latency to 51s
Sep  1 15:27:22 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 12 previous
similar messages
Sep  1 15:29:27 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:29:27 x15 kernel: LustreError: Skipped 10 previous similar
messages
Sep  1 15:37:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-81022067ec00: tried all connections, increasing
 latency to 51s
Sep  1 15:37:47 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection()) Skipped 24 previous
similar messages
Sep  1 15:38:12 x15 kernel: LustreError: 11-0: an error occurred while
communicating with [EMAIL PROTECTED] The ost_connect operation failed
with -16
Sep  1 15:38:12 x15 kernel: LustreError: Skipped 20 previous similar
messages
Sep  1 15:48:12 x15 kernel: Lustre:
3216:0:(import.c:395:import_select_connection())
data-OST0004-osc-81022067ec00: tried all connections, increasing
 latency to 51s
  END CLIENT LOG
=


Server log at corresponding time...

  SERVER LOG
=
Aug 31 04:02:04 oss1 syslogd 

Re: [Lustre-discuss] Starting a new MGS/MDS

2008-09-05 Thread Aaron Knister
Does the new MDS actually have an MGS running? FYI- you only need one  
mgs per lustre set up. In the commands you issued it doesn't look like  
you actually set up an MGS on the host mds2. Can you run an lctl  
dl on mds2 and send the output?

On Sep 4, 2008, at 4:54 PM, Ms. Megan Larko wrote:

 Hi,

 I have a new MGS/MDS that I would like to start.   It is another of
 the same Cent0S 5 kernel 2.6.18-53.1.13.el5
 lustre-1.6.4.3smp as my other boxes.  Initially I had an IP number
 that was used elsewhere in our group.  I
 changed it using the tunefs.lustre command below for the new MDT.

 [EMAIL PROTECTED] ~]# tunefs.lustre --erase-params --writeconf
 [EMAIL PROTECTED] /dev/sdd1
 checking for existing Lustre data: found CONFIGS/mountdata
 Reading CONFIGS/mountdata

   Read previous values:
 Target: crew8-MDT
 Index:  unassigned
 Lustre FS:  crew8
 Mount type: ldiskfs
 Flags:  0x71
  (MDT needs_index first_time update )
 Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
 Parameters: [EMAIL PROTECTED]


   Permanent disk data:
 Target: crew8-MDT
 Index:  unassigned
 Lustre FS:  crew8
 Mount type: ldiskfs
 Flags:  0x171
  (MDT needs_index first_time update writeconf )
 Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
 Parameters: [EMAIL PROTECTED]

 Writing CONFIGS/mountdata

 Next I try to mount this new MDT onto the system
 [EMAIL PROTECTED] ~]# mount -t lustre /dev/sdd1 /srv/lustre/mds/crew8-MDT
 mount.lustre: mount /dev/sdd1 at /srv/lustre/mds/crew8-MDT failed:
 Input/output error
 Is the MGS running?

 Ummm---  yeah, I thought the MGS is running.

 [EMAIL PROTECTED] ~]# tail /var/log/messages
 Sep  4 16:28:08 mds2 kernel: LDISKFS-fs: mounted filesystem with
 ordered data mode.
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3526:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
 1220560088, 5s ago)  [EMAIL PROTECTED] x3/t0
 o250-[EMAIL PROTECTED]@o2ib_0:26 lens 240/272 ref 1 fl Rpc:/0/0 rc
 0/-22
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3797:0:(obd_mount.c:954:server_register_target()) registration with
 the MGS failed (-5)
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3797:0:(obd_mount.c:1054:server_start_targets()) Required registration
 failed for crew8-MDT: -5
 Sep  4 16:28:13 mds2 kernel: LustreError: 15f-b: Communication error
 with the MGS.  Is the MGS running?
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3797:0:(obd_mount.c:1570:server_fill_super()) Unable to start targets:
 -5
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3797:0:(obd_mount.c:1368:server_put_super()) no obd crew8-MDT
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3797:0:(obd_mount.c:119:server_deregister_mount()) crew8-MDT not
 registered
 Sep  4 16:28:13 mds2 kernel: Lustre: server umount crew8-MDT  
 complete
 Sep  4 16:28:13 mds2 kernel: LustreError:
 3797:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount  (-5)

 The o2ib network is up.   It is ping-able via bash and lctl.   I can
 get to it from itself and from other computers on
 this local subnet.

 [EMAIL PROTECTED] ~]# lctl
 lctl  ping [EMAIL PROTECTED]
 [EMAIL PROTECTED]
 [EMAIL PROTECTED]
 lctl  ping [EMAIL PROTECTED]
 [EMAIL PROTECTED]
 [EMAIL PROTECTED]
 lctl  quit

 On this net, there are no firewalls as the computers are using only
 non-routable IP numbers.  So there is not a
 firewall issue of which I am aware...
 [EMAIL PROTECTED] ~]# iptables -L
 -bash: iptables: command not found

 The only oddity I have found is that the modules in my working MGS/MDS
 are used more than the modules in my
 new MGS/MDT.

 Correctly functioning MGS/MDT:
 [EMAIL PROTECTED] ~]# lsmod | grep mgs
 mgs   181512  1
 mgc86744  2 mgs
 ptlrpc659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
 obdclass  542200  13
 osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
 lvfs   84712  12
 osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
 libcfs183128  14
 osc 
 ,mds 
 ,fsfilt_ldiskfs 
 ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
 [EMAIL PROTECTED] ~]# lsmod | grep osc
 osc   172136  11
 ptlrpc659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
 obdclass  542200  13
 osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
 lvfs   84712  12
 osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
 libcfs183128  14
 osc 
 ,mds 
 ,fsfilt_ldiskfs 
 ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
 [EMAIL PROTECTED] ~]# lsmod | grep lnet
 lnet  255656  4 lustre,ko2iblnd,ptlrpc,obdclass
 libcfs183128  14
 osc 
 ,mds 
 ,fsfilt_ldiskfs 
 ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs

 Failing MGS/MDT:
 [EMAIL PROTECTED] ~]# lsmod | grep mgs
 mgs   181512  0
 mgc

Re: [Lustre-discuss] Starting a new MGS/MDS

2008-09-05 Thread Andreas Dilger
On Sep 05, 2008  11:11 -0400, Aaron Knister wrote:
 Does the new MDS actually have an MGS running? FYI- you only need one  
 mgs per lustre set up. In the commands you issued it doesn't look like  
 you actually set up an MGS on the host mds2. Can you run an lctl  
 dl on mds2 and send the output?

There are tradeoffs between having a single MGS for multiple filesystems,
and having one MGS per filesystem (assuming different MDS nodes).  In
general, there isn't much benefit to sharing an MGS between multiple MDS
nodes, and the drawback is that it is a single point of failure, so you
may as well have one per MDS.

 On Sep 4, 2008, at 4:54 PM, Ms. Megan Larko wrote:
 
  Hi,
 
  I have a new MGS/MDS that I would like to start.   It is another of
  the same Cent0S 5 kernel 2.6.18-53.1.13.el5
  lustre-1.6.4.3smp as my other boxes.  Initially I had an IP number
  that was used elsewhere in our group.  I
  changed it using the tunefs.lustre command below for the new MDT.
 
  [EMAIL PROTECTED] ~]# tunefs.lustre --erase-params --writeconf
  [EMAIL PROTECTED] /dev/sdd1
  checking for existing Lustre data: found CONFIGS/mountdata
  Reading CONFIGS/mountdata
 
Read previous values:
  Target: crew8-MDT
  Index:  unassigned
  Lustre FS:  crew8
  Mount type: ldiskfs
  Flags:  0x71
   (MDT needs_index first_time update )
  Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
  Parameters: [EMAIL PROTECTED]
 
 
Permanent disk data:
  Target: crew8-MDT
  Index:  unassigned
  Lustre FS:  crew8
  Mount type: ldiskfs
  Flags:  0x171
   (MDT needs_index first_time update writeconf )
  Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
  Parameters: [EMAIL PROTECTED]
 
  Writing CONFIGS/mountdata
 
  Next I try to mount this new MDT onto the system
  [EMAIL PROTECTED] ~]# mount -t lustre /dev/sdd1 
  /srv/lustre/mds/crew8-MDT
  mount.lustre: mount /dev/sdd1 at /srv/lustre/mds/crew8-MDT failed:
  Input/output error
  Is the MGS running?
 
  Ummm---  yeah, I thought the MGS is running.
 
  [EMAIL PROTECTED] ~]# tail /var/log/messages
  Sep  4 16:28:08 mds2 kernel: LDISKFS-fs: mounted filesystem with
  ordered data mode.
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3526:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at
  1220560088, 5s ago)  [EMAIL PROTECTED] x3/t0
  o250-[EMAIL PROTECTED]@o2ib_0:26 lens 240/272 ref 1 fl Rpc:/0/0 rc
  0/-22
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3797:0:(obd_mount.c:954:server_register_target()) registration with
  the MGS failed (-5)
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3797:0:(obd_mount.c:1054:server_start_targets()) Required registration
  failed for crew8-MDT: -5
  Sep  4 16:28:13 mds2 kernel: LustreError: 15f-b: Communication error
  with the MGS.  Is the MGS running?
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3797:0:(obd_mount.c:1570:server_fill_super()) Unable to start targets:
  -5
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3797:0:(obd_mount.c:1368:server_put_super()) no obd crew8-MDT
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3797:0:(obd_mount.c:119:server_deregister_mount()) crew8-MDT not
  registered
  Sep  4 16:28:13 mds2 kernel: Lustre: server umount crew8-MDT  
  complete
  Sep  4 16:28:13 mds2 kernel: LustreError:
  3797:0:(obd_mount.c:1924:lustre_fill_super()) Unable to mount  (-5)
 
  The o2ib network is up.   It is ping-able via bash and lctl.   I can
  get to it from itself and from other computers on
  this local subnet.
 
  [EMAIL PROTECTED] ~]# lctl
  lctl  ping [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  lctl  ping [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  [EMAIL PROTECTED]
  lctl  quit
 
  On this net, there are no firewalls as the computers are using only
  non-routable IP numbers.  So there is not a
  firewall issue of which I am aware...
  [EMAIL PROTECTED] ~]# iptables -L
  -bash: iptables: command not found
 
  The only oddity I have found is that the modules in my working MGS/MDS
  are used more than the modules in my
  new MGS/MDT.
 
  Correctly functioning MGS/MDT:
  [EMAIL PROTECTED] ~]# lsmod | grep mgs
  mgs   181512  1
  mgc86744  2 mgs
  ptlrpc659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
  obdclass  542200  13
  osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
  lvfs   84712  12
  osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc,obdclass
  libcfs183128  14
  osc 
  ,mds 
  ,fsfilt_ldiskfs 
  ,mgs,mgc,lustre,lov,lquota,mdc,ko2iblnd,ptlrpc,obdclass,lnet,lvfs
  [EMAIL PROTECTED] ~]# lsmod | grep osc
  osc   172136  11
  ptlrpc659512  8 osc,mds,mgs,mgc,lustre,lov,lquota,mdc
  obdclass  542200  13
  osc,mds,fsfilt_ldiskfs,mgs,mgc,lustre,lov,lquota,mdc,ptlrpc
  lvfs   84712  12
  

Re: [Lustre-discuss] small cluster

2008-09-05 Thread Andreas Dilger
On Sep 05, 2008  07:59 -0600, Kevin Van Maren wrote:
 The biggest problem with using Lustre in this environment is that the 
 Lustre _servers_ (those providing the disk space) cannot _use_ the 
 filesystem (cannot have a client and a server on same machine).

Well, it depends on how much memory pressure there is.  It definitely
works with client-on-OST (I run my home system that way, and lots of
Lustre developers do functional tests in that config) but it has the
chance of deadlock if there are memory-hungry apps running on the
same OSS.

This could be avoided by forcing all writes to be to remote OSS nodes,
possibly at the cost of some performance, though network IO on a 1GigE
is faster than to a single disk if you have a good switch.

The other (more major, IMHO) issue is that if the client/OSS node
crashes, it takes its disk with it, and now other clients cannot access
the data there.  Similarly, without RAID of the disk, any disk loss
would mean loss of 1/36 of the filesystem.

This could be avoided by having RAID1 of the disks using DRBD to a
remote node that is configured as the OST failover.  There is a wiki
page on the lustre wiki about using DRBD.

Please keep the list informed (and wiki updated) if you decide to
use this configuration.

 Paolo Supino wrote:
I'm new to the world of HPC and cluster :-( and I was given the task 
  of setting up a new (small) HPC cluster based on 36 Sun X2200 M2 
  systems. Each system as 1 136GB SAS HD. I've installed CentOS 5.2 as 
  the base OS on all the systems and through partitinoing managed to 
  have a unused partition size 65GB (and with 36 nodes I get +2TB of 
  distributed storage). Tthe rest of the HD is as follows: 4xRAM = 64GB 
  swap and ~10GB for the OS itself. I want ot put a Clustered FS on the 
  unused partitions and from looking around seems that Lustre is king of 
  the hill ...
The question I have is: Will I gain anything from using Lustre in 
  such a tight environemnt or am I going to degrade the cluster's 
  performnce that it simply not worth the effort?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre directory sizes - fast du

2008-09-05 Thread Andreas Dilger
On Sep 04, 2008  18:55 +0100, Peter Grandi wrote:
  Hi all, since our users have managed to write several TBs to
  Lustre by now, they sometimes would like to know what and how
  much there is in their directories. Is there any smarter way
  to find out than to do a du -hs dirname and wait for 30min
  for the 12TB-answer ?

One possibility would be to enable quotas with large limits for every
user that won't hamper usage.  This will track space usage for each
user.

Depending on your coding skills it might even be possible to change
the quota code so that it only tracked space usage but didn't enforce
usage limits.  This would potentially reduce the overhead of quotas
because there is no need for OSTs to check the quota limits during IO.

 If you have any patches that speed up the fetching of (what are
 likely to be) millions of records from random places on a disk
 very quickly, and also speed up the latency of the associated
 network roundtrips please let us know :-).

That is always our goal as well :-).

 I've already told them to substitute ls -l by find -type f
 -exec ls -l {};, although I'm not too sure about that either.

I don't think that will help at all. ls is a crazy bunch of
code that does stat on the directory and all kinds of extra
work.  Possibly better would be:

lfs find ${dir} -type f | xargs stat -c %b

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss