Re: [lustre-discuss] ldiskfs ost size limit

2015-07-21 Thread Götz Waschk
Thanks Ben,

is there a public document where I could have found this limit?

Regards, Götz

On Tue, Jul 21, 2015 at 4:33 PM, Ben Evans bev...@cray.com wrote:
 128 TB is the current limit

 You can force more than that, but it looks like you won't need to.

 -Ben Evans

 -Original Message-
 From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
 Behalf Of Götz Waschk
 Sent: Tuesday, July 21, 2015 10:18 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [lustre-discuss] ldiskfs ost size limit

 Dear Lustre experts,

 I'm in the process of installing a new Lustre file system based on version 
 2.5. What is the size limit for an OST when using ldiskfs?Can I format a 60 
 TB device with ldiskfs?

 Regards,
 Götz Waschk
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



-- 
AL I:40: Do what thou wilt shall be the whole of the Law.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Speeding up recovery

2015-07-21 Thread Indivar Nair
1) You mention they are on the same host.  Are they on separate partitions
already?
 As you have failover configured I'm assuming that both servers can see the
storage. In which case this will not be too difficult (depending on your
failover software of course) if they have separate partitions.

Yes, they are separate DRBD Devices. So mounting any one of them on the
other server is easy.
But how do I tell the OSS that MGS or MDT has moved to a new IP/Host?
And how do I reconfigure the failover on the device I move?

2) so today Linux clients use the native client? And you are planning on
shifting this to use the NFS service from a gateway node, is that correct?
   How do they connect to the lustre servers today? QDR IB?
 How will they reach the gateway nodes after this change? NFS over IB? NFS
over RDMA?

Yes, the Linux Hosts use Lustre Native Clients. Windows Hosts connect via
the Gateway.
The Gateway Nodes uses Infiniband+RDMA to connect to Lustre.
I am thinking of moving the Linux Native Clients to NFS, connecting them
through this Gateway.
All client nodes are on 1GbE network.
Infiniband is used only to connect the Gateway to Lustre.

Regards,


Indivar Nair


On Tue, Jul 21, 2015 at 8:29 PM, Wahl, Edward ew...@osc.edu wrote:

  1) You mention they are on the same host.  Are they on separate
 partitions already?
  As you have failover configured I'm assuming that both servers can see
 the storage. In which case this will not be too difficult (depending on
 your failover software of course) if they have separate partitions.


 2) so today Linux clients use the native client? And you are planning on
 shifting this to use the NFS service from a gateway node, is that correct?
How do they connect to the lustre servers today? QDR IB?
  How will they reach the gateway nodes after this change? NFS over IB? NFS
 over RDMA?


 Ed

  --
 *From:* lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on
 behalf of Indivar Nair [indivar.n...@techterra.in]
 *Sent:* Tuesday, July 21, 2015 4:27 AM
 *To:* lustre-discuss; hpdd-discuss
 *Subject:* [lustre-discuss] Speeding up recovery

Hi ...,

  Currently, Failover and Recovery takes a very long long time in our
 setup; almost 20 Minutes. We would like to make it as fast as possible.

  I have two queries regarding this -

 1.
 ===
  The MGS and MDT are on the same host.

  We do however have a passive stand-by server for the MGS/MDT server,
 which only mounts these partitions in case of a failure.

  *Current Setup*
  Server A: MGS+MDT
  Server B: Failover MGS+MDT

  I was wondering whether I can now move the MGS or MDT Partition to the
 standby server (so that imperative recovery works properly) -

  *New Setup*
  Server A: MDT  *Failover MGS*
  Server B: *MGS*  Failover MDT

 *OR *
 Server A: *MGS*  Failover MDT
  Server B: MDT  *Failover MGS*

  i.e.

 *Can I separate the MDT and MGS partitions on to different machines
 without formatting or reinstalling Lustre? *
 ===

 2.
 ===
  This storage is used by around 150 Workstations and 150 Compute (Render)
 Nodes.

  Out of these 150 workstations, around 30 - 40 are MS Windows. The MS
 Windows clients access the storage through a 2-node Samba Gateway Cluster.

  The Gateway Nodes are connected to the storage through a QDR Infiniband
 Network.

  We were thinking of adding NFS Service to the Samba Gateway nodes, and
 reconfiguring the Linux clients to connect via this gateway.

  This will bring down the direct Lustre Clients to just 2 nodes.
  *So, will having only 2 clients improve the failover-recovery time?*
  ===

  Is there anything else we can do to speed up recovery?

  Regards,


  Indivar Nair

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Speeding up recovery

2015-07-21 Thread Wahl, Edward
1) You mention they are on the same host.  Are they on separate partitions 
already?
 As you have failover configured I'm assuming that both servers can see the 
storage. In which case this will not be too difficult (depending on your 
failover software of course) if they have separate partitions.


2) so today Linux clients use the native client? And you are planning on 
shifting this to use the NFS service from a gateway node, is that correct?
   How do they connect to the lustre servers today? QDR IB?
 How will they reach the gateway nodes after this change? NFS over IB? NFS over 
RDMA?


Ed


From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Indivar Nair [indivar.n...@techterra.in]
Sent: Tuesday, July 21, 2015 4:27 AM
To: lustre-discuss; hpdd-discuss
Subject: [lustre-discuss] Speeding up recovery

Hi ...,

Currently, Failover and Recovery takes a very long long time in our setup; 
almost 20 Minutes. We would like to make it as fast as possible.

I have two queries regarding this -

1.
===
The MGS and MDT are on the same host.

We do however have a passive stand-by server for the MGS/MDT server, which only 
mounts these partitions in case of a failure.

Current Setup
Server A: MGS+MDT
Server B: Failover MGS+MDT

I was wondering whether I can now move the MGS or MDT Partition to the standby 
server (so that imperative recovery works properly) -

New Setup
Server A: MDT  Failover MGS
Server B: MGS  Failover MDT
   OR
Server A: MGS  Failover MDT
Server B: MDT  Failover MGS

i.e.
Can I separate the MDT and MGS partitions on to different machines without 
formatting or reinstalling Lustre?
===

2.
===
This storage is used by around 150 Workstations and 150 Compute (Render) Nodes.

Out of these 150 workstations, around 30 - 40 are MS Windows. The MS Windows 
clients access the storage through a 2-node Samba Gateway Cluster.

The Gateway Nodes are connected to the storage through a QDR Infiniband Network.

We were thinking of adding NFS Service to the Samba Gateway nodes, and 
reconfiguring the Linux clients to connect via this gateway.

This will bring down the direct Lustre Clients to just 2 nodes.
So, will having only 2 clients improve the failover-recovery time?
===

Is there anything else we can do to speed up recovery?

Regards,


Indivar Nair
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre Server Sizing

2015-07-21 Thread Indivar Nair
Hi ...,

One of our customers has a 3 x 240 Disk SAN Storage Array and would like to
convert it to Lustre.

They have around 150 Workstations and around 200 Compute (Render) nodes.
The File Sizes they generally work with are -
1 to 1.5 million files (images) of 10-20MB in size.
And a few thousand files of 500-1000MB in size.

Almost 50% of the infra is on MS Windows or Apple MACs

I was thinking of the following configuration -
1 MDS
1 Failover MDS
3 OSS (failover to each other)
3 NFS+CIFS Gateway Servers
FDR Infiniband backend network (to connect the Gateways to Lustre)
Each Gateway Server will have 8 x 10GbE Frontend Network (connecting the
clients)

*Option A*
10+10 Disk RAID60 Array with 64KB Chunk Size i.e. 1MB Stripe Width
720 Disks / (10+10) = 36 Arrays.
12 OSTs per OSS
18 OSTs per OSS in case of Failover

*Option B*
10+10+10+10 Disk RAID60 Array with 128KB Chunk Size i.e. 4MB Stripe
Width
720 Disks / (10+10+10+10) = 18 Arrays
6 OSTs per OSS
9 OSTs per OSS in case of Failover
4MB RPC and I/O

*Questions*
1. Would it be better to let Lustre do most of the striping / file
distribution (as in Option A) OR would it be better to let the RAID
Controllers do it (as in Option B)

2. Will Option B allow us to have lesser CPU/RAM than Option A?

Regards,


Indivar Nair
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] A quick question about reusing osts (lustre 2.5.3)

2015-07-21 Thread Kurt Strosahl
I was wondering because while looking at the man page for mkfs.lustre I saw the 
below option:

   --replace
  Used  to  initialize a target with the same --index as a 
previously used target if the old target was permanently lost for some reason 
(e.g. multiple disk failure or massive corruption).  This avoids
  having the target try to register as a new target with the MGS.

w/r,
Kurt

- Original Message -
From: Ben Evans bev...@cray.com
To: Kurt Strosahl stros...@jlab.org, lustre-discuss@lists.lustre.org
Sent: Tuesday, July 21, 2015 10:17:47 AM
Subject: RE: A quick question about reusing osts (lustre 2.5.3)

I know you'd need to keep the config files, directory structures, etc.  How 
much of that info you need to keep around, I'm not 100% sure.

To get the MGS to accept it again, you may have to unmount and run 
tunefs.lustre --writeconf on all the targets.

-Ben Evans

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Kurt Strosahl
Sent: Tuesday, July 21, 2015 9:15 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] A quick question about reusing osts (lustre 2.5.3)

Hello,

   I had a quick question about recreating osts... If I drain all the files off 
an ost can I just reformat it and have it added back into lustre, in essence 
reusing the same index?  The server wouldn't change.  Or would I have to 
preserve its configuration files?

w/r,
Kurt
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ldiskfs ost size limit

2015-07-21 Thread Justin Miller
It is available in the Lustre Operations Manual, Table 1.1



https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#idp162992


- Justin

On 7/21/15, 11:02 AM, lustre-discuss on behalf of Götz Waschk 
lustre-discuss-boun...@lists.lustre.org on behalf of goetz.was...@gmail.com 
wrote:

Thanks Ben,

is there a public document where I could have found this limit?

Regards, Götz

On Tue, Jul 21, 2015 at 4:33 PM, Ben Evans bev...@cray.com wrote:
 128 TB is the current limit

 You can force more than that, but it looks like you won't need to.

 -Ben Evans

 -Original Message-
 From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
 Behalf Of Götz Waschk
 Sent: Tuesday, July 21, 2015 10:18 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [lustre-discuss] ldiskfs ost size limit

 Dear Lustre experts,

 I'm in the process of installing a new Lustre file system based on version 
 2.5. What is the size limit for an OST when using ldiskfs?Can I format a 60 
 TB device with ldiskfs?

 Regards,
 Götz Waschk
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



-- 
AL I:40: Do what thou wilt shall be the whole of the Law.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [HPDD-discuss] Lustre Server Sizing

2015-07-21 Thread Indivar Nair
Hi Scott,

The 3 - SAN Storages with 240 disks each has its own 3 NAS Headers (NAS
Appliances).
However, even with 240 10K RPM disk and RAID50, it is only providing around
1.2 - 1.4GB/s per NAS Header.

There is no clustered file system, and each NAS Header has its own
file-system.
It uses some custom mechanism to present the 3 file systems as single name
space.
But the directories have to be manually spread across for load-balancing.
As you can guess, this doesn't work most of the time.
Many a times, most of the compute nodes access a single NAS Header,
overloading it.

The customer wants *at least* 9GB/s throughput from a single file-system.

But I think, if we architect the Lustre Storage correctly, with these many
disks, we should get at least 18GB/s throughput, if not more.

Regards,


Indivar Nair


On Tue, Jul 21, 2015 at 10:15 PM, Scott Nolin scott.no...@ssec.wisc.edu
wrote:

 An important question is what performance do they have now, and what do
 they expect if converting it to Lustre. Our more basically, what are they
 looking for in general in changing?

 The performance requirements may help drive your OSS numbers for example,
 or interconnect, and all kinds of stuff.

 Also I don't have a lot of experience with NFS/CIFS gateways, but that is
 perhaps it's own topic and may need some close attention.

 Scott

 On 7/21/2015 10:57 AM, Indivar Nair wrote:

 Hi ...,

 One of our customers has a 3 x 240 Disk SAN Storage Array and would like
 to convert it to Lustre.

 They have around 150 Workstations and around 200 Compute (Render) nodes.
 The File Sizes they generally work with are -
 1 to 1.5 million files (images) of 10-20MB in size.
 And a few thousand files of 500-1000MB in size.

 Almost 50% of the infra is on MS Windows or Apple MACs

 I was thinking of the following configuration -
 1 MDS
 1 Failover MDS
 3 OSS (failover to each other)
 3 NFS+CIFS Gateway Servers
 FDR Infiniband backend network (to connect the Gateways to Lustre)
 Each Gateway Server will have 8 x 10GbE Frontend Network (connecting the
 clients)

 *Option A*
  10+10 Disk RAID60 Array with 64KB Chunk Size i.e. 1MB Stripe Width
  720 Disks / (10+10) = 36 Arrays.
  12 OSTs per OSS
  18 OSTs per OSS in case of Failover

 *Option B*
  10+10+10+10 Disk RAID60 Array with 128KB Chunk Size i.e. 4MB Stripe
 Width
  720 Disks / (10+10+10+10) = 18 Arrays
  6 OSTs per OSS
  9 OSTs per OSS in case of Failover
  4MB RPC and I/O

 *Questions*
 1. Would it be better to let Lustre do most of the striping / file
 distribution (as in Option A) OR would it be better to let the RAID
 Controllers do it (as in Option B)

 2. Will Option B allow us to have lesser CPU/RAM than Option A?

 Regards,


 Indivar Nair



 ___
 HPDD-discuss mailing list
 hpdd-disc...@lists.01.org
 https://lists.01.org/mailman/listinfo/hpdd-discuss




 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Server Sizing

2015-07-21 Thread Patrick Farrell
Note the other email also seemed to suggest that multiple NFS exports of Lustre 
wouldn't work.  I don't think that's the case, as we have this sort of setup at 
a number of our customers without particular trouble.  In the abstract, I could 
see the possibility of some caching errors between different clients, but that 
would be only namespace stuff, not data.  And I think in practice that's ok.

But regardless, as Andreas said, for the Linux clients, Lustre directly will 
give much better results.

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Dilger, Andreas [andreas.dil...@intel.com]
Sent: Tuesday, July 21, 2015 6:59 PM
To: Indivar Nair
Cc: hpdd-discuss; to: lustre-discuss
Subject: Re: [lustre-discuss] Lustre Server Sizing

Having only 3 OSS will limit the performance you can get, and having so many 
OSTs on each OSS will give sub-optimal performance. 4-6 OSTs/OSS is more 
reasonable.

It also isn't clear why you want RAID-60 instead of just RAID-10?

Finally, for Linux clients it is much better to use direct Lustre access 
instead of NFS as mentioned in another email.

Cheers, Andreas

On Jul 21, 2015, at 08:58, Indivar Nair 
indivar.n...@techterra.inmailto:indivar.n...@techterra.in wrote:

Hi ...,

One of our customers has a 3 x 240 Disk SAN Storage Array and would like to 
convert it to Lustre.

They have around 150 Workstations and around 200 Compute (Render) nodes.
The File Sizes they generally work with are -
1 to 1.5 million files (images) of 10-20MB in size.
And a few thousand files of 500-1000MB in size.

Almost 50% of the infra is on MS Windows or Apple MACs

I was thinking of the following configuration -
1 MDS
1 Failover MDS
3 OSS (failover to each other)
3 NFS+CIFS Gateway Servers
FDR Infiniband backend network (to connect the Gateways to Lustre)
Each Gateway Server will have 8 x 10GbE Frontend Network (connecting the 
clients)

Option A
10+10 Disk RAID60 Array with 64KB Chunk Size i.e. 1MB Stripe Width
720 Disks / (10+10) = 36 Arrays.
12 OSTs per OSS
18 OSTs per OSS in case of Failover

Option B
10+10+10+10 Disk RAID60 Array with 128KB Chunk Size i.e. 4MB Stripe Width
720 Disks / (10+10+10+10) = 18 Arrays
6 OSTs per OSS
9 OSTs per OSS in case of Failover
4MB RPC and I/O

Questions
1. Would it be better to let Lustre do most of the striping / file distribution 
(as in Option A) OR would it be better to let the RAID Controllers do it (as in 
Option B)

2. Will Option B allow us to have lesser CPU/RAM than Option A?

Regards,


Indivar Nair

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [HPDD-discuss] Lustre Server Sizing

2015-07-21 Thread Cowe, Malcolm J
I’ve seen CTDB + Samba deployed on several sites running Lustre. It’s stable in 
my experience, and straightforward to get installed and set up, although the 
process is time-consuming. The most significant hurdle is integrating with AD 
and maybe load balancing for the CTDB servers (RR DNS is the easiest and most 
common solution).

Performance is not nearly as good as for native Lustre client (apart from 
anything else, IIRC, SMB is a “chatty” protocol, esp with xattrs?). One 
downside of CTDB is that Lustre client must be mounted with -oflock in order 
for the recovery lock manager to work. Each individual connection to Samba from 
a Windows client is limited to the bandwidth and single thread performance of 
the CTDB node. Clients remain connected to a single CTDB node for the duration 
of their session, so there is a possibility of an imbalance in connections over 
time. Load balancing is strictly round-robin through DNS lookups, unless a more 
sophisticated load balancer is placed in front of the CTDB cluster.

There are references to CTDB + NFS / Ganesha as well but I haven’t had an 
opportunity to try it out. Most of the demand for non-native client access to 
Lustre involves Windows machines.

Malcolm.


From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Jeff Johnson
Sent: Wednesday, July 22, 2015 5:54 AM
To: Indivar Nair
Cc: lustre-discuss
Subject: Re: [lustre-discuss] [HPDD-discuss] Lustre Server Sizing

Indivar,

Since your CIFS or NFS gateways operate as Lustre clients there can be issues 
with running multiple NFS or CIFS gateway machines frontending the same Lustre 
filesystem. As Lustre clients there are no issues in terms of file locking but 
the NFS and CIFS caching and multi-client file access mechanics don't interface 
with Lustre's file locking mechanics. Perhaps that may have changed recently 
and a developer on the list may comment on developments there. So while you 
could provide client access through multiple NFS or CIFS gateway machines there 
would not be much in the way of file locking protection. There is a way to 
configure pCIFS with CTDB and get close to what you envision with Samba. I did 
that configuration once as a proof of concept (no valuable data). It is a 
*very* complex configuration and based on the state of software when I did it I 
wouldn't say it was a production grade environment.

As I said before, my understanding may be a year out of date and someone else 
could speak to the current state of things. Hopefully that would be a better 
story.

--Jeff



On Tue, Jul 21, 2015 at 10:26 AM, Indivar Nair 
indivar.n...@techterra.inmailto:indivar.n...@techterra.in wrote:
Hi Scott,

The 3 - SAN Storages with 240 disks each has its own 3 NAS Headers (NAS 
Appliances).
However, even with 240 10K RPM disk and RAID50, it is only providing around 1.2 
- 1.4GB/s per NAS Header.
There is no clustered file system, and each NAS Header has its own file-system.
It uses some custom mechanism to present the 3 file systems as single name 
space.
But the directories have to be manually spread across for load-balancing.
As you can guess, this doesn't work most of the time.
Many a times, most of the compute nodes access a single NAS Header, overloading 
it.

The customer wants *at least* 9GB/s throughput from a single file-system.

But I think, if we architect the Lustre Storage correctly, with these many 
disks, we should get at least 18GB/s throughput, if not more.

Regards,

Indivar Nair


On Tue, Jul 21, 2015 at 10:15 PM, Scott Nolin 
scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu wrote:
An important question is what performance do they have now, and what do they 
expect if converting it to Lustre. Our more basically, what are they looking 
for in general in changing?

The performance requirements may help drive your OSS numbers for example, or 
interconnect, and all kinds of stuff.

Also I don't have a lot of experience with NFS/CIFS gateways, but that is 
perhaps it's own topic and may need some close attention.

Scott

On 7/21/2015 10:57 AM, Indivar Nair wrote:
Hi ...,

One of our customers has a 3 x 240 Disk SAN Storage Array and would like
to convert it to Lustre.

They have around 150 Workstations and around 200 Compute (Render) nodes.
The File Sizes they generally work with are -
1 to 1.5 million files (images) of 10-20MB in size.
And a few thousand files of 500-1000MB in size.

Almost 50% of the infra is on MS Windows or Apple MACs

I was thinking of the following configuration -
1 MDS
1 Failover MDS
3 OSS (failover to each other)
3 NFS+CIFS Gateway Servers
FDR Infiniband backend network (to connect the Gateways to Lustre)
Each Gateway Server will have 8 x 10GbE Frontend Network (connecting the
clients)

*Option A*
 10+10 Disk RAID60 Array with 64KB Chunk Size i.e. 1MB Stripe Width
 720 Disks / (10+10) = 36 Arrays.
 12 OSTs per OSS
 18 OSTs per OSS in case of Failover

*Option B*
 10+10+10+10 Disk RAID60 

Re: [lustre-discuss] [HPDD-discuss] Lustre Server Sizing

2015-07-21 Thread Jeff Johnson
Indivar,

Since your CIFS or NFS gateways operate as Lustre clients there can be
issues with running multiple NFS or CIFS gateway machines frontending the
same Lustre filesystem. As Lustre clients there are no issues in terms of
file locking but the NFS and CIFS caching and multi-client file access
mechanics don't interface with Lustre's file locking mechanics. Perhaps
that may have changed recently and a developer on the list may comment on
developments there. So while you could provide client access through
multiple NFS or CIFS gateway machines there would not be much in the way of
file locking protection. There is a way to configure pCIFS with CTDB and
get close to what you envision with Samba. I did that configuration once as
a proof of concept (no valuable data). It is a *very* complex configuration
and based on the state of software when I did it I wouldn't say it was a
production grade environment.

As I said before, my understanding may be a year out of date and someone
else could speak to the current state of things. Hopefully that would be a
better story.

--Jeff



On Tue, Jul 21, 2015 at 10:26 AM, Indivar Nair indivar.n...@techterra.in
wrote:

 Hi Scott,

 The 3 - SAN Storages with 240 disks each has its own 3 NAS Headers (NAS
 Appliances).
 However, even with 240 10K RPM disk and RAID50, it is only providing
 around 1.2 - 1.4GB/s per NAS Header.

 There is no clustered file system, and each NAS Header has its own
 file-system.
 It uses some custom mechanism to present the 3 file systems as single name
 space.
 But the directories have to be manually spread across for load-balancing.
 As you can guess, this doesn't work most of the time.
 Many a times, most of the compute nodes access a single NAS Header,
 overloading it.

 The customer wants *at least* 9GB/s throughput from a single file-system.

 But I think, if we architect the Lustre Storage correctly, with these many
 disks, we should get at least 18GB/s throughput, if not more.

 Regards,


 Indivar Nair


 On Tue, Jul 21, 2015 at 10:15 PM, Scott Nolin scott.no...@ssec.wisc.edu
 wrote:

 An important question is what performance do they have now, and what do
 they expect if converting it to Lustre. Our more basically, what are they
 looking for in general in changing?

 The performance requirements may help drive your OSS numbers for example,
 or interconnect, and all kinds of stuff.

 Also I don't have a lot of experience with NFS/CIFS gateways, but that is
 perhaps it's own topic and may need some close attention.

 Scott

 On 7/21/2015 10:57 AM, Indivar Nair wrote:

 Hi ...,

 One of our customers has a 3 x 240 Disk SAN Storage Array and would like
 to convert it to Lustre.

 They have around 150 Workstations and around 200 Compute (Render) nodes.
 The File Sizes they generally work with are -
 1 to 1.5 million files (images) of 10-20MB in size.
 And a few thousand files of 500-1000MB in size.

 Almost 50% of the infra is on MS Windows or Apple MACs

 I was thinking of the following configuration -
 1 MDS
 1 Failover MDS
 3 OSS (failover to each other)
 3 NFS+CIFS Gateway Servers
 FDR Infiniband backend network (to connect the Gateways to Lustre)
 Each Gateway Server will have 8 x 10GbE Frontend Network (connecting the
 clients)

 *Option A*
  10+10 Disk RAID60 Array with 64KB Chunk Size i.e. 1MB Stripe Width
  720 Disks / (10+10) = 36 Arrays.
  12 OSTs per OSS
  18 OSTs per OSS in case of Failover

 *Option B*
  10+10+10+10 Disk RAID60 Array with 128KB Chunk Size i.e. 4MB Stripe
 Width
  720 Disks / (10+10+10+10) = 18 Arrays
  6 OSTs per OSS
  9 OSTs per OSS in case of Failover
  4MB RPC and I/O

 *Questions*
 1. Would it be better to let Lustre do most of the striping / file
 distribution (as in Option A) OR would it be better to let the RAID
 Controllers do it (as in Option B)

 2. Will Option B allow us to have lesser CPU/RAM than Option A?

 Regards,


 Indivar Nair



 ___
 HPDD-discuss mailing list
 hpdd-disc...@lists.01.org
 https://lists.01.org/mailman/listinfo/hpdd-discuss




 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] How Lustre stores hyperslabs and chunks of HDF5?

2015-07-21 Thread Dilger, Andreas
Currently there is no direct connection between Lustre layout and HDF5 file 
layout.  The only option is RAID-0 striping across OST objects with a fixed 
stripe size. If HDF5 is aware of this stripe size and can take advantage of it, 
that is great.

There is a project that has started to implement Progressive File Layout (PFL) 
that allows different extents of a file to have different stripe counts and 
stripe sizes, which could be leveraged by libraries like HDF5 in the future.

See 
http://cdn.opensfs.org/wp-content/uploads/2015/04/Progressive-File-Layouts_Hammond.pdf
 and/or https://www.youtube.com/watch?v=5rm6Nlmqdp0 for more details on PFL 
prototype development.

Cheers, Andreas

On Jul 19, 2015, at 20:37, 
prakrati.agra...@shell.commailto:prakrati.agra...@shell.com 
prakrati.agra...@shell.commailto:prakrati.agra...@shell.com wrote:

Hi,

I wanted to understand how Lustre stores the chunks and hyperslabs in HDF5 
framework on the OSTs?
If I set the chunk size and each rank is writing a hyperslab, then OST0 has 
chunk0, OST1 has chunk1 and so on or is it that OST0 has hyperslab0, OST1 has 
hyperslab1.
Is there any way of finding that out?

Thanks and Regards,
Prakrati
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Speeding up recovery

2015-07-21 Thread Dilger, Andreas
I believe this is described in the Lustre Manual, but the basic process to 
split a combined MDS+MGS into a separate MGS is to format a new MGS device, 
then copy all the files from CONFIGS on the old combined MDT+MGT device into 
the new MGS. See the manual for full details.

Cheers, Andreas

On Jul 21, 2015, at 01:27, Indivar Nair 
indivar.n...@techterra.inmailto:indivar.n...@techterra.in wrote:

Hi ...,

Currently, Failover and Recovery takes a very long long time in our setup; 
almost 20 Minutes. We would like to make it as fast as possible.

I have two queries regarding this -

1.
===
The MGS and MDT are on the same host.

We do however have a passive stand-by server for the MGS/MDT server, which only 
mounts these partitions in case of a failure.

Current Setup
Server A: MGS+MDT
Server B: Failover MGS+MDT

I was wondering whether I can now move the MGS or MDT Partition to the standby 
server (so that imperative recovery works properly) -

New Setup
Server A: MDT  Failover MGS
Server B: MGS  Failover MDT
   OR
Server A: MGS  Failover MDT
Server B: MDT  Failover MGS

i.e.
Can I separate the MDT and MGS partitions on to different machines without 
formatting or reinstalling Lustre?
===

2.
===
This storage is used by around 150 Workstations and 150 Compute (Render) Nodes.

Out of these 150 workstations, around 30 - 40 are MS Windows. The MS Windows 
clients access the storage through a 2-node Samba Gateway Cluster.

The Gateway Nodes are connected to the storage through a QDR Infiniband Network.

We were thinking of adding NFS Service to the Samba Gateway nodes, and 
reconfiguring the Linux clients to connect via this gateway.

This will bring down the direct Lustre Clients to just 2 nodes.
So, will having only 2 clients improve the failover-recovery time?
===

Is there anything else we can do to speed up recovery?

Regards,


Indivar Nair
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Server Sizing

2015-07-21 Thread Dilger, Andreas
Having only 3 OSS will limit the performance you can get, and having so many 
OSTs on each OSS will give sub-optimal performance. 4-6 OSTs/OSS is more 
reasonable.

It also isn't clear why you want RAID-60 instead of just RAID-10?

Finally, for Linux clients it is much better to use direct Lustre access 
instead of NFS as mentioned in another email.

Cheers, Andreas

On Jul 21, 2015, at 08:58, Indivar Nair 
indivar.n...@techterra.inmailto:indivar.n...@techterra.in wrote:

Hi ...,

One of our customers has a 3 x 240 Disk SAN Storage Array and would like to 
convert it to Lustre.

They have around 150 Workstations and around 200 Compute (Render) nodes.
The File Sizes they generally work with are -
1 to 1.5 million files (images) of 10-20MB in size.
And a few thousand files of 500-1000MB in size.

Almost 50% of the infra is on MS Windows or Apple MACs

I was thinking of the following configuration -
1 MDS
1 Failover MDS
3 OSS (failover to each other)
3 NFS+CIFS Gateway Servers
FDR Infiniband backend network (to connect the Gateways to Lustre)
Each Gateway Server will have 8 x 10GbE Frontend Network (connecting the 
clients)

Option A
10+10 Disk RAID60 Array with 64KB Chunk Size i.e. 1MB Stripe Width
720 Disks / (10+10) = 36 Arrays.
12 OSTs per OSS
18 OSTs per OSS in case of Failover

Option B
10+10+10+10 Disk RAID60 Array with 128KB Chunk Size i.e. 4MB Stripe Width
720 Disks / (10+10+10+10) = 18 Arrays
6 OSTs per OSS
9 OSTs per OSS in case of Failover
4MB RPC and I/O

Questions
1. Would it be better to let Lustre do most of the striping / file distribution 
(as in Option A) OR would it be better to let the RAID Controllers do it (as in 
Option B)

2. Will Option B allow us to have lesser CPU/RAM than Option A?

Regards,


Indivar Nair

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Speeding up recovery

2015-07-21 Thread Indivar Nair
Hi ...,

Currently, Failover and Recovery takes a very long long time in our setup;
almost 20 Minutes. We would like to make it as fast as possible.

I have two queries regarding this -

1.
===
The MGS and MDT are on the same host.

We do however have a passive stand-by server for the MGS/MDT server, which
only mounts these partitions in case of a failure.

*Current Setup*
Server A: MGS+MDT
Server B: Failover MGS+MDT

I was wondering whether I can now move the MGS or MDT Partition to the
standby server (so that imperative recovery works properly) -

*New Setup*
Server A: MDT  *Failover MGS*
Server B: *MGS*  Failover MDT

*OR*
Server A: *MGS*  Failover MDT
Server B: MDT  *Failover MGS*

i.e.

*Can I separate the MDT and MGS partitions on to different machines without
formatting or reinstalling Lustre?*
===

2.
===
This storage is used by around 150 Workstations and 150 Compute (Render)
Nodes.

Out of these 150 workstations, around 30 - 40 are MS Windows. The MS
Windows clients access the storage through a 2-node Samba Gateway Cluster.

The Gateway Nodes are connected to the storage through a QDR Infiniband
Network.

We were thinking of adding NFS Service to the Samba Gateway nodes, and
reconfiguring the Linux clients to connect via this gateway.

This will bring down the direct Lustre Clients to just 2 nodes.
*So, will having only 2 clients improve the failover-recovery time?*
===

Is there anything else we can do to speed up recovery?

Regards,


Indivar Nair
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] A quick question about reusing osts (lustre 2.5.3)

2015-07-21 Thread Kurt Strosahl
Hello,

   I had a quick question about recreating osts... If I drain all the files off 
an ost can I just reformat it and have it added back into lustre, in essence 
reusing the same index?  The server wouldn't change.  Or would I have to 
preserve its configuration files?

w/r,
Kurt
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] A quick question about reusing osts (lustre 2.5.3)

2015-07-21 Thread Ben Evans
I know you'd need to keep the config files, directory structures, etc.  How 
much of that info you need to keep around, I'm not 100% sure.

To get the MGS to accept it again, you may have to unmount and run 
tunefs.lustre --writeconf on all the targets.

-Ben Evans

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Kurt Strosahl
Sent: Tuesday, July 21, 2015 9:15 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] A quick question about reusing osts (lustre 2.5.3)

Hello,

   I had a quick question about recreating osts... If I drain all the files off 
an ost can I just reformat it and have it added back into lustre, in essence 
reusing the same index?  The server wouldn't change.  Or would I have to 
preserve its configuration files?

w/r,
Kurt
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ldiskfs ost size limit

2015-07-21 Thread Götz Waschk
Dear Lustre experts,

I'm in the process of installing a new Lustre file system based on
version 2.5. What is the size limit for an OST when using ldiskfs?Can
I format a 60 TB device with ldiskfs?

Regards,
Götz Waschk
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ldiskfs ost size limit

2015-07-21 Thread Ben Evans
128 TB is the current limit

You can force more than that, but it looks like you won't need to.

-Ben Evans

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Götz Waschk
Sent: Tuesday, July 21, 2015 10:18 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] ldiskfs ost size limit

Dear Lustre experts,

I'm in the process of installing a new Lustre file system based on version 2.5. 
What is the size limit for an OST when using ldiskfs?Can I format a 60 TB 
device with ldiskfs?

Regards,
Götz Waschk
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org