[lustre-discuss] Is IML abandonded?

2022-10-06 Thread Scott Nolin via lustre-discuss
I see the whamcloud repo is abandoned and I don't really see any forks or other 
information out there, does anyone knows if this is abandoned?

https://github.com/whamcloud/integrated-manager-for-lustre

Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre [2.8.0] and the Linux Automounter

2017-06-21 Thread Scott Nolin
We use them fairly extensively with 2.7 and 2.8 and find them useful and stable.
 
For cluster nodes, perhaps not a big deal, but for our many non-cluster clients 
it is useful. We have multiple filesystems so sometimes many mounts. The main 
benefit I think is not that sometimes they unmount, but if there are ever 
issues and some filesystems temprorarily not available – much, much less pain.
 
Scott
 
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Chad DeWitt
Sent: Monday, June 19, 2017 8:52 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre [2.8.0] and the Linux Automounter
 
Good morning, All.
 
We are considering using Lustre [2.8.0] with the Linux automounter.  Is anyone 
using this combination successfully?  Are there any caveats?  
 
(I did check JIRA, but only found two tickets concerning 1.x Lustre.)
 
Thank you in advance,
Chad



Chad DeWitt, CISSP | HPC Storage Administrator
UNC Charlotte | ITS – University Research Computing



smime.p7s
Description: S/MIME cryptographic signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lshowmount equivalent?

2015-12-14 Thread Scott Nolin


On 12/14/2015 12:43 AM, Dilger, Andreas wrote:
...

Is this a tool that you are using?  IIRC, there wasn't a particular reason
that it was removed, except that when we asked LLNL (the authors) they
said they were no longer using it, and we couldn't find anyone that was
using it so it was removed in commit b5a7260ae8f along with a bunch of
other old tools.


Thanks for the reply, indeed we were using it. We don't use it daily, 
but when doing some things it is really convenient.




If there is a demand for lshowmount I don't think it would be hard to
reinstate.



If it makes more sense for it to be a separate tool outside the lustre 
code base, that'd be fine too I think.


Thanks,
Scott




smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lshowmount equivalent?

2015-12-13 Thread Scott Nolin
We noticed with recent versions of lustre lshowmount has disappeared. 
Annoyingly, it's still in the lustre docs, this is at least noted with a 
ticket:


https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438219_64286

https://jira.hpdd.intel.com/browse/LUDOC-308

It's one of those tools that is very handy if you know about it, or was 
while it was there...


Is there a new command that is similar that isn't documented or something?

Thanks,
Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Large-scale UID/GID changes via lfs

2015-10-15 Thread Scott Nolin
We use robinhood policy engine to do this for GIDs to provide a fake 
group quota sort of thing.


If you have to do this routinely, look into robinhood.

Regards,
Scott


On 10/14/2015 9:52 PM, Ms. Megan Larko wrote:

Hello,

I have been able to successfully use "lfs lsetfacl ." to set and
modify permissions on a Lustre file system quickly with a small system
because the lfs is directed at the Lustre MDT.  It is similar, I
imagine, to using "lfs find..." to search a Lustre fs compared with a
*nix "find..." command,  the latter which must touch every stripe
located on any OST.

So, how do a change UID and/or GID over a Lustre file system?  Doing a
*nix find and chown seems to have the same detrimental performance.

 >lfs lgetfact my.file
The above returns the file ACL info.  I can change permissions and add a
group or user access/perm but I don't know how to change the "header"
information. (To see the difference in header information, one could try
"lfs lgetfact --no-head my.file" which shows the ACL info without the
header.)

 >lfs lsetfacl -muser:newPerson:rwx my.file
The above adds user with those perms to the original user listed in the
header info.

This is using Lustre version 2.6.x (forgot minor number). on RHEL 6.5.

Suggestions genuinely appreciated.
Cheers,
megan



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org






smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre client server interoperability

2015-08-12 Thread Scott Nolin
I'd just add that we've been generally OK for running a variety of 2.X 
servers vs 2.whatever clients.


For our latest project I hear we're going to try 2.7 server and 2.8 
client. The client machines for us are much more likely to need OS 
versions pushed forward.


Regarding interim patches to Lustre, my feeling is the important thing 
is to simply know what patches are critical. I believe all the patches 
are still public from Intel (and how about others people providing 
lustre patches).


There has been some discussion about sharing information on people's 
patch sets on wiki.lustre.org, but I haven't see anything come out.


Patrick, is Cray providing public maintenance releases? Or sharing 
information on important patches?


Scott


On 8/12/2015 7:32 AM, Patrick Farrell wrote:

Jon,

You've got the interop right.

Unfortunately, Intel is no longer doing public maintenance versions of Lustre, 
so 2.8 will not receive updates after release.

- Patrick

From: Jon Tegner [jon.teg...@foi.se]
Sent: Wednesday, August 12, 2015 1:16 AM
To: Patrick Farrell; Kurt Strosahl
Cc: lustre-discuss@lists.lustre.org; Jan Pettersson
Subject: SV: lustre client server interoperability

So if I understand correctly one has the following centos options:

1. Lustre-2.5.3 with CentOS-6 on both clients and servers.
2. Lustre-2.5.3, CentOS-6 on servers, and 2.7.0 and CentOS-7 on clients.
3. Wait a while and use Lustre-2.8.0/CentOS-7 on clients and servers.

At least on clients I would prefer to run CentOS-7, but if 2.7 (and 2.8 - will 
this version receive updates?) are less reliable that might not be a good idea?

Any thoughts on this would be greatly appreciated.

Thanks!

/jon


Från: lustre-discuss lustre-discuss-boun...@lists.lustre.org för Patrick Farrell 
p...@cray.com
Skickat: den 11 augusti 2015 21:23
Till: Kurt Strosahl
Kopia: lustre-discuss@lists.lustre.org
Ämne: Re: [lustre-discuss] lustre client server interoperability

No - 2.5 is the last public stable client release from Intel.

On 8/11/15, 2:22 PM, Kurt Strosahl stros...@jlab.org wrote:


So is there a stable client for centos 7 that is backwards compatible
with 2.5.3?

w/r,
Kurt

- Original Message -
From: Patrick Farrell p...@cray.com
To: Kurt Strosahl stros...@jlab.org, lustre-discuss@lists.lustre.org
Sent: Monday, August 10, 2015 4:24:15 PM
Subject: RE: lustre client server interoperability

Kurt,

Yes.  It's worth noting that 2.7 is probably marginally less reliable
than 2.5, since it has had no updates/fixes since it was released.

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
of Kurt Strosahl [stros...@jlab.org]
Sent: Monday, August 10, 2015 2:25 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] lustre client server interoperability

Hello,

   Is the 2.7 lustre client compatible with lustre 2.5.3 servers?  I'm
running a 2.5.3 system lustre system, but have been asked by a few people
about upgrading some of our clients to CentOS 7 (which appears to need a
2.7 or greater client).

w/r,
Kurt J. Strosahl
System Administrator
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org






smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Size difference between du and quota

2015-05-20 Thread Scott Nolin
Another thing to think about is does he perhaps own files outside of his 
directory? The quota is on the volume but you are only doing du on the 
directory.


Even if he's not aware of it, things can happen like people using rsync 
and preserving ownership. The original owner's usage then goes up.


Scott

On 5/20/2015 3:50 AM, Phill Harvey-Smith wrote:

Hi all,

One of my users is reporting a massive size difference between the
figures reported by du and quota.

doing a du -hs on his directory reports :
  du -hs .
529G.

doing a lfs quota -u username /storage reports
Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
/storage 621775192  64000 64001   -  601284  100 110
   -

Though this user does have a lot of files :

find . -type f | wc -l
581519

So I suspect that it is the typical thing that quota is reporting used
blocks whilst du is reporting used bytes, which can of course be wildly
different due to filesystem overhead and wasted unused space at the end
of files where a block is allocated but only partially used.

Is this likely to be the case ?

I'm also not entirely sure what versions of lustre the client machines
and MDS / OSS servers are running, as I didn't initially set the system up.

Cheers.

Phill.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org





smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] OpenSFS / EOFS Presentations index

2015-05-11 Thread Scott Nolin
It would be really convenient if all the presentations for various LUG, 
LAD, and similar meetings were available in one page.


Ideally there would also be some kind of keywords for each presentation 
for easy searches, but even just having a comprehensive list of links 
would be valuable I think.


Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

2015-05-05 Thread Scott Nolin
I just want to second what Rick said - It's create/remove not stat of 
files where there are performance penalties. We covered this issue for 
our workload just by using SSD's for our mdt, when normally we'd just 
use fast SAS drives.


A bigger deal for us was RAM on the server, and improvements with SPL 0.6.3+

Scott



It's Lustre on ZFS, especially for metadata operations that create,
modify, or remove inodes. Native ZFS metadata operations are much
faster than what Lustre on ZFS is currently providing. That said,
we've gone with a ZFS-based patchless MDS, since read operations have
always been more critical for us, and our performance is more than
adequate.

--Rick





smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.

2015-04-30 Thread Scott Nolin
Has anyone been working with the lustre jobstats feature and SLURM? We 
have been, and it's OK. But now that I'm working on systems that run a 
lot of array jobs and a fairly recent slurm version we found some ugly 
stuff.


Array jobs report their do SLURM_JOBID as a variable, and it's unique 
for every job. But they use other IDs too that appear only for array jobs.


http://slurm.schedmd.com/job_array.html

However, that unique SLURM_JOBID as far as I can tell is only truly 
exposed in command line tools via 'scontrol' - which is only valid while 
the job is running. If you want to look at older jobs with sacct for 
example, things are troublesome.


Here's what my coworker and I have figured out:

- You submit a (non-array) job that gets jobid 100.
- The next job gets jobid 101.
- Then submit a 10 task array job. That gets jobid 102. The sub tasks 
get 9 more job ids. If nothing else is happening with the system, that 
means you use jobid 102 to 112.


If things were that orderly, you could cope with using SLURM_JOB_ID in 
lustre jobstats pretty easily. Use sacct and you see job 102_2 - you 
know that is jobid 103 in lustre jobstats.


But, if other jobs get submitted during set up (as of course they do), 
they can take jobid 103. So, you've got problems.


I think we may try to set a magic variable in the slurm prolog and use 
that for the jobstats_var, but who knows.


Scott



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS pages on lustre.wiki.org

2015-04-30 Thread Scott Nolin

Chris and others,

I have moved my ZFS notes to the lustre wiki and made a category -

http://wiki.lustre.org/Category:ZFS

I am hoping others will post their notes or useful tips, and have talked 
to a few people.


Can you make a top level link to the category?

Thank you,
Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST partition sizes

2015-04-29 Thread Scott Nolin

Ok I looked up my notes.

I'm not really sure what you mean by record size. I assumed when I do a 
file per process the block size = file size. And that's what I see 
dropped on the filesystem.


I did -F -b size

With block sizes 1MB, 20MB, 100MB, 200MB, 500MB

2, 4, 8, 16 threads on 1 to 4 clients.

I assumed 2 threads on 1 client looks a lot like a client writing or 
reading 2 files. I didn't bother looking at 1 thread.


Later I just started doing 100MB tests since it's a very common file 
size for us. Plus I didn't see real big difference once size gets bigger 
than that.


Scott


On 4/29/2015 10:24 AM, Alexander I Kulyavtsev wrote:

What range of record sizes did you use for IOR? This is more important
than file size.
100MB is small, overall data size (# of files) shall be twice as memory.
I ran series of test for small record size for raidz2 10+2; will re-run
some tests after upgrading to 0.6.4.1 .

Single file performance differs substantially from file per process.

Alex.

On Apr 29, 2015, at 9:38 AM, Scott Nolin scott.no...@ssec.wisc.edu
mailto:scott.no...@ssec.wisc.edu wrote:


I used IOR, singlefile, 100MB files. That's the most important
workload for us. I tried several different file sizes, but 100MB
seemed a reasonable compromise for what I see the most. We rarely or
never do file striping.

I remember I did see a difference between 10+2 and 8+2. Especially at
smaller numbers of clients and threads, the 8+2 performance numbers
were more consistent, made a smoother curve. 10+2 with not a lot of
threads the performance was more variable.







smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST partition sizes

2015-04-29 Thread Scott Nolin
Ah, I used 256K xfersize for all my tests. 1MB would probably be a 
better test.


Scott

On 4/29/2015 11:38 AM, Alexander I Kulyavtsev wrote:

ior/bin/IOR.mpiio.mvapich2-2.0b -h

-t N  transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g)

IOR reports it in the log :

Command line used:
/home/aik/lustre/benchmark/git/ior/bin/IOR.mpiio.mvapich2-2.0b -v -a
MPIIO -i5 -g -e -w -r -b 16g -C -t 8k -o
/mnt/lfs/admin/iotest/ior/stripe_2/ior-testfile.ssf
...
Summary:

 api= MPIIO (version=3, subversion=0)
 test filename  =
/mnt/lfs/admin/iotest/ior/stripe_2/ior-testfile.ssf
 access = single-shared-file, independent
 pattern= segmented (1 segment)
 ordering in a file = sequential offsets
 ordering inter file=constant task offsets = 1
 clients= 32 (8 per node)
 repetitions= 5
 xfersize   = 8192 bytes
 blocksize  = 16 GiB
 aggregate filesize = 512 GiB

Here we have xfersize 8k, each client of 32 writes 16GB, so the
aggregate file size is 512GB.

I would expect records size to be ~1MB for our workloads.

Best regards, Alex.

On Apr 29, 2015, at 11:07 AM, Scott Nolin scott.no...@ssec.wisc.edu
mailto:scott.no...@ssec.wisc.edu wrote:


Ok I looked up my notes.

I'm not really sure what you mean by record size. I assumed when I do
a file per process the block size = file size. And that's what I see
dropped on the filesystem.

I did -F -b size

With block sizes 1MB, 20MB, 100MB, 200MB, 500MB

2, 4, 8, 16 threads on 1 to 4 clients.

I assumed 2 threads on 1 client looks a lot like a client writing or
reading 2 files. I didn't bother looking at 1 thread.

Later I just started doing 100MB tests since it's a very common file
size for us. Plus I didn't see real big difference once size gets
bigger than that.

Scott


On 4/29/2015 10:24 AM, Alexander I Kulyavtsev wrote:

What range of record sizes did you use for IOR? This is more important
than file size.
100MB is small, overall data size (# of files) shall be twice as memory.
I ran series of test for small record size for raidz2 10+2; will re-run
some tests after upgrading to 0.6.4.1 .

Single file performance differs substantially from file per process.

Alex.

On Apr 29, 2015, at 9:38 AM, Scott Nolin scott.no...@ssec.wisc.edu
mailto:scott.no...@ssec.wisc.edu
mailto:scott.no...@ssec.wisc.edu wrote:


I used IOR, singlefile, 100MB files. That's the most important
workload for us. I tried several different file sizes, but 100MB
seemed a reasonable compromise for what I see the most. We rarely or
never do file striping.

I remember I did see a difference between 10+2 and 8+2. Especially at
smaller numbers of clients and threads, the 8+2 performance numbers
were more consistent, made a smoother curve. 10+2 with not a lot of
threads the performance was more variable.





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org mailto:lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org







smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] New community release model and 2.5.3 (and 2.x.0) patch lists?

2015-04-15 Thread Scott Nolin
Since Intel will not be making community releases for 2.5.4 or 2.x.0 
releases now, it seems the community will need to maintain some sort of 
patch list against these releases.  Especially stability, data 
corruption, and security patches.


I think this is important so if people are trying a Lustre community 
release they need to be aware of any bugs that might exist, and if 
they're addressed. If things are unstable, lustre will (re)gain a 
negative reputation as a file system you should not trust with real data.


I don't have any answers here, but would like to start a wider conversation.

Thanks,
Scott


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] Interpreting stats files

2014-11-07 Thread Scott Nolin

Here's how I understand it:

First number = number of times (samples) the OST has handled a read or 
write.


Second number = the minimum read/write size

Third number = maximum read/write size

Fourth = sum of all the read/write requests in bytes, the quantity of 
data read/written.


Since this is in the exports area, it's all that per export of course.

I am working on a wiki page for sharing information on stats details 
like this which hopefully will be available soon, it's almost ready. It 
will include links to tools like lltop, xltop, and also how to roll your 
own.


Scott

On 11/7/2014 2:29 PM, Dragseth Roy Einar wrote:

Is there a description of the stats file formats anywhere?

I'm especially interested in the

/proc/fs/lustre/obdfilter/*/exports/*/stats

files.  For instance, what does the last three numbers on the read_bytes or 
write_bytes lines mean?

[root@oss1 ~]# cat 
/proc/fs/lustre/obdfilter/uit-OST0009/exports/192.168.255.161@o2ib/stats
snapshot_time 1415391739.442581 secs.usecs
read_bytes365486 samples [bytes] 2720 1048576 363509372808
write_bytes   12183 samples [bytes] 7384 1048576 12762602232
preprw377677 samples [reqs]
commitrw  377669 samples [reqs]
ping  832 samples [reqs]

(I'm trying to create a simple tool to scan all OSSes and detect the worst IO 
hogs on our system)

Regards,
Roy.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss






smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre and ZFS notes available

2014-08-14 Thread Scott Nolin

Hi Andrew,

Much of this information is notes and not in a finished format, so it's 
a problem of how much time we have.


The other issue is contributing to the manual is somewhat cumbersome as 
you have to submit patches - 
https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source


The bar there is a bit higher - we have to be pretty confident the 
information that's added is correct, know if it's just for some versions 
of lustre, and so on. As opposed to simply here are our notes that work 
for us.


I will try to review what we have and if anything looks really incorrect 
or missing in the lustre manual we will attempt to issue a patch.


I think in general the lustre manual is correct, but not always 
sufficient. I think the process does make sure incorrect stuff doesn't 
go in at least, but makes it hard to add information.


Scott

On 8/14/2014 6:13 AM, Andrew Holway wrote:

Hi Scott,

Great job! Would you consider merging with the standard Lustre docs?

https://wiki.hpdd.intel.com/display/PUB/Documentation

Thanks,

Andrew


On 12 August 2014 18:58, Scott Nolin scott.no...@ssec.wisc.edu
mailto:scott.no...@ssec.wisc.edu wrote:

Hello,

At UW SSEC my group has been using Lustre for a few years, and
recently Lustre with ZFS as the back end file system. We have found
the Lustre community very open and helpful in sharing information.
Specifically information from various LUG and LAD meetings and the
mailing lists has been very helpful.

With this in mind we would like to share some of our internal
documentation and notes that may be useful to others. These are
working notes, so not a complete guide.

I want to be clear that the official Lustre documentation should be
considered the correct reference material in general. But this
information may be helpful for some -

http://www.ssec.wisc.edu/~__scottn/ http://www.ssec.wisc.edu/~scottn/

Topics that I think of particular interest may be lustre zfs install
notes and JBOD monitoring.

Scott Nolin
UW SSEC



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org mailto:Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss







smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Lustre and ZFS notes available

2014-08-12 Thread Scott Nolin

Hello,

At UW SSEC my group has been using Lustre for a few years, and recently 
Lustre with ZFS as the back end file system. We have found the Lustre 
community very open and helpful in sharing information. Specifically 
information from various LUG and LAD meetings and the mailing lists has 
been very helpful.


With this in mind we would like to share some of our internal 
documentation and notes that may be useful to others. These are working 
notes, so not a complete guide.


I want to be clear that the official Lustre documentation should be 
considered the correct reference material in general. But this 
information may be helpful for some -


http://www.ssec.wisc.edu/~scottn/

Topics that I think of particular interest may be lustre zfs install 
notes and JBOD monitoring.


Scott Nolin
UW SSEC




smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] number of inodes in zfs MDT

2014-06-16 Thread Scott Nolin
We ran some scrub performance tests,  and even without tunables set it wasn't 
too bad,  for our specific configuration.  The main thing we did was verify it 
made sense to scrub all OSTs simultaneously. 

Anyway,  indeed scrub or resilver aren't about Defrag. 

Further, the mds performance issues aren't about fragmentation. 

 A side note,  it's probably ideal to stay below 80% due to fragmentation for 
ldiskfs too or performance degrades. 

Sean,  note I am dealing with specific issues for a very create intense 
workload,  and this is on the mds only where we may change. The data integrity 
features of Zfs make it very attractive too. I fully expect things will improve 
too with Zfs. 

If you want a lot of certainty in your choices,  you may want to consult 
various vendors if lustre systems. 

Scott 




On June 8, 2014 11:42:15 AM CDT, Dilger, Andreas andreas.dil...@intel.com 
wrote:
Scrub and resilver have nothing to so with defrag.

Scrub is scanning of all the data blocks in the pool to verify their
checksums and parity to detect silent data corruption, and rewrite the
bad blocks if necessary.

Resilver is reconstructing a failed disk onto a new disk using parity
or mirror copies of all the blocks on the failed disk. This is similar
to scrub.

Both scrub and resilver can be done online, though resilver of course
requires a spare disk to rebuild onto, which may not be possible to add
to a running system if your hardware does not support it.

Both of them do not improve the performance or layout of data on
disk. They do impact performance because they cause a lot if random IO
to the disks, though this impact can be limited by tunables on the
pool.

Cheers, Andreas

On Jun 8, 2014, at 4:21, Sean Brisbane
s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.uk
wrote:

Hi Scott,

We are considering running zfs backed lustre and the factor of 10ish
performance hit you see worries me. I know zfs can splurge bits of
files all over the place by design. The oracle docs do recommend
scrubbing the volumes and keeping usage below 80% for maintenance and
performance reasons, I'm going to call it 'defrag' but I'm sure someone
who knows better will probably correct me as to why it is not the same.
So are these performance issues after scubbing and is it possible to
scrub online - I.e. some reasonable level of performance is maintained
while the scrub happens?
Resilvering is also recommended. Not sure if that is for performance
reasons.

http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html



Sent from my HTC Desire C on Three

- Reply message -
From: Scott Nolin
scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu
To: Anjana Kar k...@psc.edumailto:k...@psc.edu,
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] number of inodes in zfs MDT
Date: Fri, Jun 6, 2014 3:23 AM



Looking at some of our existing zfs filesystems, we have a couple with
zfs mdts

One has 103M inodes and uses 152G of MDT space, another 12M and 19G.
I’d plan for less than that I guess as Mr. Dilger suggests. It all
depends on your expected average file size and number of files for what
will work.

We have run into some unpleasant surprises with zfs for the MDT, I
believe mostly documented in bug reports, or at least hinted at.

A serious issue we have is performance of the zfs arc cache over time.
This is something we didn’t see in early testing, but with enough use
it grinds things to a crawl. I believe this may be addressed in the
newer version of ZFS, which we’re hopefully awaiting.

Another thing we’ve seen, which is mysterious to me is this it appears
hat as the MDT begins to fill up file create rates go down. We don’t
really have a strong handle on this (not enough for a bug report I
think), but we see this:


  1.
The aforementioned 104M inode / 152GB MDT system has 4 SAS drives
raid10. On initial testing file creates were about 2500 to 3000 IOPs
per second. Follow up testing in it’s current state (about half full..)
shows them at about 500 IOPs now, but with a few iterations of mdtest
those IOPs plummet quickly to unbearable levels (like 30…).
  2.
We took a snapshot of the filesystem and sent it to the backup MDS,
this time with the MDT built on 4 SAS drives in a raid0 - really not
for performance so much as “extra headroom” if that makes any sense.
Testing this the IOPs started higher, at maybe 800 or 1000 (this is
from memory, I don’t have my data in front of me). That initial faster
speed could just be writing to 4 spindles I suppose, but surprising to
me, the performance degraded at a slower rate. It took much longer to
get painfully slow. It still got there. The performance didn’t degrade
at the same rate if that makes sense - the same number of writes on the
smaller/slower mdt degraded the performance more quickly.  My guess is
that had something to do with the total space available

Re: [Lustre-discuss] number of inodes in zfs MDT

2014-06-12 Thread Scott Nolin

Just a note, I see zfs-0.6.3 has just been annoounced:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4

I also see it is upgraded in the zfs/lustre repo.

The changelog notes the default as changed to 3/4 
arc_c_max and a variety of other fixes, many focusing on 
performance.


So Anjana this is probably worth testing, especially if 
you're considering drastic measures.


We upgraded for our MDS, so this file create issue is 
harder for us to test now (literally started testing 
writes this afternoon, and it's not degraded yet, so far 
at 20 million writes). Since your problem still happens 
fairly quickly I'm sure any information you have will be 
very helpful to add to LU-2476. And if it helps, it may 
save you some pain.


We will likely install the upgrade but may not be able to 
test millions of writes any time soon, as the filesystem 
is needed for production.


Regards,
Scott


On Thu, 12 Jun 2014 16:41:14 +
 Dilger, Andreas andreas.dil...@intel.com wrote:
It looks like you've already increased arc_meta_limit 
beyond the default, which is c_max / 4. That was critical 
to performance in our testing.


There is also a patch from Brian that should help 
performance in your case:

http://review.whamcloud.com/10237

Cheers, Andreas

On Jun 11, 2014, at 12:53, Scott Nolin 
scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu 
wrote:


We tried a few arc tunables as noted here:

https://jira.hpdd.intel.com/browse/LU-2476

However, I didn't find any clear benefit in the long 
term. We were just trying a few things without a lot of 
insight.


Scott

On 6/9/2014 12:37 PM, Anjana Kar wrote:
Thanks for all the input.

Before we move away from zfs MDT, I was wondering if we 
can try setting zfs
tunables to test the performance. Basically what's a 
value we can use for
arc_meta_limit for our system? Are there are any others 
settings that can

be changed?

Generating small files on our current system, things 
started off at 500

files/sec,
then declined so it was about 1/20th of that after 2.45 
million files.


-Anjana

On 06/09/2014 10:27 AM, Scott Nolin wrote:
We ran some scrub performance tests, and even without 
tunables set it
wasn't too bad, for our specific configuration. The main 
thing we did
was verify it made sense to scrub all OSTs 
simultaneously.


Anyway, indeed scrub or resilver aren't about Defrag.

Further, the mds performance issues aren't about 
fragmentation.


A side note, it's probably ideal to stay below 80% due 
to

fragmentation for ldiskfs too or performance degrades.

Sean, note I am dealing with specific issues for a very 
create intense
workload, and this is on the mds only where we may 
change. The data
integrity features of Zfs make it very attractive too. I 
fully expect

things will improve too with Zfs.

If you want a lot of certainty in your choices, you may 
want to

consult various vendors if lustre systems.

Scott




On June 8, 2014 11:42:15 AM CDT, Dilger, Andreas
andreas.dil...@intel.commailto:andreas.dil...@intel.com 
wrote:


  Scrub and resilver have nothing to so with defrag.

  Scrub is scanning of all the data blocks in the pool 
to verify their checksums and parity to detect silent 
data corruption, and rewrite the bad blocks if necessary.


  Resilver is reconstructing a failed disk onto a new 
disk using parity or mirror copies of all the blocks on 
the failed disk. This is similar to scrub.


  Both scrub and resilver can be done online, though 
resilver of course requires a spare disk to rebuild onto, 
which may not be possible to add to a running system if 
your hardware does not support it.


  Both of them do not improve the performance or 
layout of data on disk. They do impact performance 
because they cause a lot if random IO to the disks, 
though this impact can be limited by tunables on the 
pool.


  Cheers, Andreas

  On Jun 8, 2014, at 4:21, Sean Brisbane 
s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.uk 
wrote:


  Hi Scott,

  We are considering running zfs backed lustre and the 
factor of 10ish performance hit you see worries me. I 
know zfs can splurge bits of files all over the place by 
design. The oracle docs do recommend scrubbing the 
volumes and keeping usage below 80% for maintenance and 
performance reasons, I'm going to call it 'defrag' but 
I'm sure someone who knows better will probably correct 
me as to why it is not the same.
  So are these performance issues after scubbing and is 
it possible to scrub online - I.e. some reasonable level 
of performance is maintained while the scrub happens?
  Resilvering is also recommended. Not sure if that is 
for performance reasons.


  http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html



  Sent from my HTC Desire C on Three

  - Reply message -
  From: Scott Nolin 
scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu
  To: Anjana Kar 
k

Re: [Lustre-discuss] number of inodes in zfs MDT

2014-06-11 Thread Scott Nolin

We tried a few arc tunables as noted here:

https://jira.hpdd.intel.com/browse/LU-2476

However, I didn't find any clear benefit in the long term. We were just 
trying a few things without a lot of insight.


Scott

On 6/9/2014 12:37 PM, Anjana Kar wrote:

Thanks for all the input.

Before we move away from zfs MDT, I was wondering if we can try setting zfs
tunables to test the performance. Basically what's a value we can use for
arc_meta_limit for our system? Are there are any others settings that can
be changed?

Generating small files on our current system, things started off at 500
files/sec,
then declined so it was about 1/20th of that after 2.45 million files.

-Anjana

On 06/09/2014 10:27 AM, Scott Nolin wrote:

We ran some scrub performance tests, and even without tunables set it
wasn't too bad, for our specific configuration. The main thing we did
was verify it made sense to scrub all OSTs simultaneously.

Anyway, indeed scrub or resilver aren't about Defrag.

Further, the mds performance issues aren't about fragmentation.

A side note, it's probably ideal to stay below 80% due to
fragmentation for ldiskfs too or performance degrades.

Sean, note I am dealing with specific issues for a very create intense
workload, and this is on the mds only where we may change. The data
integrity features of Zfs make it very attractive too. I fully expect
things will improve too with Zfs.

If you want a lot of certainty in your choices, you may want to
consult various vendors if lustre systems.

Scott




On June 8, 2014 11:42:15 AM CDT, Dilger, Andreas
andreas.dil...@intel.com wrote:

Scrub and resilver have nothing to so with defrag.

Scrub is scanning of all the data blocks in the pool to verify their 
checksums and parity to detect silent data corruption, and rewrite the bad 
blocks if necessary.

Resilver is reconstructing a failed disk onto a new disk using parity or 
mirror copies of all the blocks on the failed disk. This is similar to scrub.

Both scrub and resilver can be done online, though resilver of course 
requires a spare disk to rebuild onto, which may not be possible to add to a 
running system if your hardware does not support it.

Both of them do not improve the performance or layout of data on disk. 
They do impact performance because they cause a lot if random IO to the disks, though 
this impact can be limited by tunables on the pool.

Cheers, Andreas

On Jun 8, 2014, at 4:21, Sean Brisbane 
s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.uk wrote:

Hi Scott,

We are considering running zfs backed lustre and the factor of 10ish 
performance hit you see worries me. I know zfs can splurge bits of files all 
over the place by design. The oracle docs do recommend scrubbing the volumes 
and keeping usage below 80% for maintenance and performance reasons, I'm going 
to call it 'defrag' but I'm sure someone who knows better will probably correct 
me as to why it is not the same.
So are these performance issues after scubbing and is it possible to scrub 
online - I.e. some reasonable level of performance is maintained while the 
scrub happens?
Resilvering is also recommended. Not sure if that is for performance 
reasons.

http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html



Sent from my HTC Desire C on Three

- Reply message -
From: Scott Nolin 
scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu
To: Anjana Kar k...@psc.edumailto:k...@psc.edu, 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org 
lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] number of inodes in zfs MDT
Date: Fri, Jun 6, 2014 3:23 AM



Looking at some of our existing zfs filesystems, we have a couple with zfs 
mdts

One has 103M inodes and uses 152G of MDT space, another 12M and 19G. I’d 
plan for less than that I guess as Mr. Dilger suggests. It all depends on your 
expected average file size and number of files for what will work.

We have run into some unpleasant surprises with zfs for the MDT, I believe 
mostly documented in bug reports, or at least hinted at.

A serious issue we have is performance of the zfs arc cache over time. This 
is something we didn’t see in early testing, but with enough use it grinds 
things to a crawl. I believe this may be addressed in the newer version of ZFS, 
which we’re hopefully awaiting.

Another thing we’ve seen, which is mysterious to me is this it appears hat 
as the MDT begins to fill up file create rates go down. We don’t really have a 
strong handle on this (not enough for a bug report I think), but we see this:


   1.
The aforementioned 104M inode / 152GB MDT system has 4 SAS drives raid10. 
On initial testing file creates were about 2500 to 3000 IOPs per second. Follow 
up testing in it’s current state (about half full..) shows them at about 500 
IOPs now, but with a few

Re: [Lustre-discuss] number of inodes in zfs MDT

2014-06-05 Thread Scott Nolin
Looking at some of our existing zfs filesystems, we have a couple with zfs mdts 




One has 103M inodes and uses 152G of MDT space, another 12M and 19G. I’d plan 
for less than that I guess as Mr. Dilger suggests. It all depends on your 
expected average file size and number of files for what will work.


We have run into some unpleasant surprises with zfs for the MDT, I believe 
mostly documented in bug reports, or at least hinted at.


A serious issue we have is performance of the zfs arc cache over time. This is 
something we didn’t see in early testing, but with enough use it grinds things 
to a crawl. I believe this may be addressed in the newer version of ZFS, which 
we’re hopefully awaiting.


Another thing we’ve seen, which is mysterious to me is this it appears hat as 
the MDT begins to fill up file create rates go down. We don’t really have a 
strong handle on this (not enough for a bug report I think), but we see this:


The aforementioned 104M inode / 152GB MDT system has 4 SAS drives raid10. On 
initial testing file creates were about 2500 to 3000 IOPs per second. Follow up 
testing in it’s current state (about half full..) shows them at about 500 IOPs 
now, but with a few iterations of mdtest those IOPs plummet quickly to 
unbearable levels (like 30…).


We took a snapshot of the filesystem and sent it to the backup MDS, this time 
with the MDT built on 4 SAS drives in a raid0 - really not for performance so 
much as “extra headroom” if that makes any sense. Testing this the IOPs started 
higher, at maybe 800 or 1000 (this is from memory, I don’t have my data in 
front of me). That initial faster speed could just be writing to 4 spindles I 
suppose, but surprising to me, the performance degraded at a slower rate. It 
took much longer to get painfully slow. It still got there. The performance 
didn’t degrade at the same rate if that makes sense - the same number of writes 
on the smaller/slower mdt degraded the performance more quickly.  My guess is 
that had something to do with the total space available. Who knows. I believe 
restarting lustre (and certainly rebooting) ‘resets the clock’ on the file 
create performance degradation.



For that problem we’re just going to try adding 4 SSD’s, but it’s an ugly 
problem. Also are once again hopeful new zfs version addresses it.


And finally, we’ve got a real concern with snapshot backups of the MDT that my 
colleague posted about - the problem we see manifests in essentially a 
read-only recovered file system, so it’s a concern and not quite terrifying.


All in all, the next lustre file system we bring up (in a couple weeks) we are 
very strongly considering going with ldiskfs for the MDT this time.


Scott














From: Anjana Kar
Sent: ‎Tuesday‎, ‎June‎ ‎3‎, ‎2014 ‎7‎:‎38‎ ‎PM
To: lustre-discuss@lists.lustre.org





Is there a way to set the number of inodes for zfs MDT?

I've tried using --mkfsoptions=-N value mentioned in lustre 2.0 
manual, but it
fails to accept it. We are mirroring 2 80GB SSDs for the MDT, but the 
number of
inodes is getting set to 7 million, which is not enough for a 100TB 
filesystem.

Thanks in advance.

-Anjana Kar
  Pittsburgh Supercomputing Center
  k...@psc.edu
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Building Lustre 2.4.3 Client on EL5?

2014-05-07 Thread Scott Nolin

We have had success with rhel5 2.1.x clients and 2.4.0-1 servers.

On a test system with 2.5 servers we've also just rolled (rhel6) clients 
back to 2.1.6 due to what looks like a client bug and it seems to work too.


Scott

On 5/7/2014 10:47 AM, Jones, Peter A wrote:

Mike

I would think either a 1.8.x-wc1 or 2.1.x release should allow you to have 
RHEL5 clients with 2.4.3 servers

Peter

On 5/7/14, 8:35 AM, Mike Hanby mha...@uab.edumailto:mha...@uab.edu wrote:

If it's not possible, what is the recommended client version to use on EL5 
clients with 2.4.3 backend Lustre servers?

Thanks,

Mike




From: Mike Hanby mha...@uab.edu
Sent: Wednesday, May 07, 2014 10:33AM
To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org 
lustre-discuss@lists.lustre.org
Subject: [Lustre-discuss] Building Lustre 2.4.3 Client on EL5?


Howdy,

Is it possible to build the Lustre 2.4.3 client to run on RHEL/CentOS
5.10 x86_64? Or is any version past 2.3 in capable of building and
functioning on the latest EL5?

Thanks,

Mike
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss






smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [zfs-discuss] Problems getting Lustre started with ZFS

2013-10-24 Thread Scott Nolin

You can do this with the --index option to mkfs, for example:

mkfs.lustre --fsname=(name) --ost --backfstype=zfs --index=0 
--mgsnode=(etc)


Makes , --index=1 makes 1, and so on.

Scott

On 10/24/2013 2:35 AM, Andrew Holway wrote:

You need to use unique index numbers for each OST, i.e. OST,
OST1, etc.


I cannot see how to control this? I am creating new OST's but they are
all getting the same index number.

Could this be a problem with the mgs?

Thanks,

Andrew



Ned

To unsubscribe from this group and stop receiving emails from it, send an email 
to zfs-discuss+unsubscr...@zfsonlinux.org.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss






smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ldiskfs for MDT and zfs for OSTs?

2013-10-08 Thread Scott Nolin
I would check to make sure your ldev.conf file is set up with the 
lustre-ost0 and host name properly.


Scott

On 10/8/2013 10:40 AM, Anjana Kar wrote:

The git checkout was on Sep. 20. Was the patch before or after?

The zpool create command successfully creates a raidz2 pool, and mkfs.lustre
does not complain, but

[root@cajal kar]# zpool list
NAME  SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
lustre-ost0  36.2T  2.24M  36.2T 0%  1.00x  ONLINE  -

[root@cajal kar]# /usr/sbin/mkfs.lustre --fsname=cajalfs --ost
--backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0

[root@cajal kar]# /sbin/service lustre start lustre-ost0
lustre-ost0 is not a valid lustre label on this node

I think we'll be splitting up the MDS and OSTs on 2 nodes as some of you
said
there could be other issues down the road, but thanks for all the good
suggestions.

-Anjana

On 10/07/2013 07:24 PM, Ned Bass wrote:

I'm guessing your git checkout doesn't include this commit:

* 010a78e Revert LU-3682 tunefs: prevent tunefs running on a mounted device

It looks like the LU-3682 patch introduced a bug that could cause your issue,
so its reverted in the latest master.

Ned

On Mon, Oct 07, 2013 at 04:54:13PM -0400, Anjana Kar wrote:

On 10/07/2013 04:27 PM, Ned Bass wrote:

On Mon, Oct 07, 2013 at 02:23:32PM -0400, Anjana Kar wrote:

Here is the exact command used to create a raidz2 pool with 8+2 drives,
followed by the error messages:

mkfs.lustre --fsname=cajalfs --reformat --ost --backfstype=zfs
--index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0/ost0 raidz2
/dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdi /dev/sdk /dev/sdm
/dev/sdo /dev/sdq /dev/sds

mkfs.lustre FATAL: Invalid filesystem name /dev/sds

It seems that either the version of mkfs.lustre you are using has a
parsing bug, or there was some sort of syntax error in the actual
command entered.  If you are certain your command line is free from
errors, please post the version of lustre you are using, or report the
bug in the Lustre issue tracker.

Thanks,
Ned

For building this server, I followed steps from the walk-thru-build*
for Centos 6.4,
and added --with-spl and --with-zfs when configuring lustre..
*https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821

spl and zfs modules were installed from source for the lustre 2.4 kernel
2.6.32.358.18.1.el6_lustre2.4

Device sds appears to be valid, but I will try issuing the command
using by-path
names..

-Anjana


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss






smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ldiskfs for MDT and zfs for OSTs?

2013-10-07 Thread Scott Nolin



Ned


Here is the exact command used to create a raidz2 pool with 8+2 drives,
followed by the error messages:

mkfs.lustre --fsname=cajalfs --reformat --ost --backfstype=zfs --index=0
--mgsnode=10.10.101.171@o2ib lustre-ost0/ost0 raidz2 /dev/sda /dev/sdc
/dev/sde /dev/sdg /dev/sdi /dev/sdk /dev/sdm /dev/sdo /dev/sdq /dev/sds

mkfs.lustre FATAL: Invalid filesystem name /dev/sds

mkfs.lustre FATAL: unable to prepare backend (22)
mkfs.lustre: exiting with 22 (Invalid argument)

dmesg shows
ZFS: Loaded module v0.6.2-1, ZFS pool version 5000, ZFS filesystem version 5

Any suggestions on creating the pool separately?


Just make sure you can see /dev/sds in your system - if not, that's your 
problem.


I would also suggest consider building this without using these top 
level dev names. It is very easy for these to change accidentally. If 
you're just testing it's fine, but over time it will be a problem.


See
http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool

I like the vdev_id.conf with meaningful (to our sysadmins) aliases to 
device 'by-path'.


Scott



smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.

2013-08-23 Thread Scott Nolin

You might also try increasing the vfs_cache_pressure.

This will reclaim inode and dentry caches faster. Maybe that's the 
problem, not page caches.


To be clear - I have no deep insight into Lustre's use of the client 
cache, but you said you has lots of small files, which if lustre uses 
the cache system like other filesystems means it may be inodes/dentries. 
Filling up the page cache with files like you did in your other tests 
wouldn't have the same effect. Just my guess here.


We had some experience years ago with the opposite sort of problem. We 
have a big ftp server, and we want to *keep* inode/dentry data in the 
linux cache, as there are often stupid numbers of files in directories. 
Files were always flowing through the server, so the page cache would 
force out the inode cache. Was surprised to find with linux there's no 
ability to set a fixed inode cache size - the best you can do is 
suggest with the cache pressure tunable.


Scott

On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:

I tried to change swapiness from 0 to 95 but it did not have any impact on the
system overhead.

r.


On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:

No, I cannot detect any swap activity on the system.

r.

On Thursday 22. August 2013 09.21.33 you wrote:

Is this slowdown due to increased swap activity?  If yes, then try
lowering the swappiness value.  This will sacrifice buffer cache space
to
lower swap activity.

Take a look at http://en.wikipedia.org/wiki/Swappiness.

Roger S.

On 08/22/2013 08:51 AM, Roy Dragseth wrote:

We have just discovered that a large buffer cache generated from
traversing a lustre file system will cause a significant system overhead
for applications with high memory demands.  We have seen a 50% slowdown
or worse for applications.  Even High Performance Linpack, that have no
file IO whatsoever is affected.  The only remedy seems to be to empty
the
buffer cache from memory by running echo 3  /proc/sys/vm/drop_caches

Any hints on how to improve the situation is greatly appreciated.


System setup:
Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
connection
to lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64
and
lustre v2.1.6 rpms downloaded from whamcloud download site.

Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site).
Each OSS has 12 OST, total 1.1 PB storage.

How to reproduce:

Traverse the lustre file system until the buffer cache is large enough.
In our case we run

   find . -print0 -type f | xargs -0 cat  /dev/null

on the client until the buffer cache reaches ~15-20GB.  (The lustre file
system has lots of small files so this takes up to an hour.)

Kill the find process and start a single node parallel application, we
use
HPL (high performance linpack).  We run on all 16 cores on the system
with 1GB ram per core (a normal run should complete in appr. 150
seconds.)  The system monitoring shows a 10-20% system cpu overhead and
the HPL run takes more than 200 secs.  After running echo 3 
/proc/sys/vm/drop_caches the system performance goes back to normal
with
a run time at 150 secs.

I've created an infographic from our ganglia graphs for the above
scenario.

https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png

Attached is an excerpt from perf top indicating that the kernel routine
taking the most time is _spin_lock_irqsave if that means anything to
anyone.


Things tested:

It does not seem to matter if we mount lustre over infiniband or
ethernet.

Filling the buffer cache with files from an NFS filesystem does not
degrade
performance.

Filling the buffer cache with one large file does not give degraded
performance. (tested with iozone)


Again, any hints on how to proceed is greatly appreciated.


Best regards,
Roy.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss





smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.

2013-08-23 Thread Scott Nolin

I forgot to add 'slabtop' is a nice tool for watching this stuff.

Scott

On 8/23/2013 9:36 AM, Scott Nolin wrote:

You might also try increasing the vfs_cache_pressure.

This will reclaim inode and dentry caches faster. Maybe that's the
problem, not page caches.

To be clear - I have no deep insight into Lustre's use of the client
cache, but you said you has lots of small files, which if lustre uses
the cache system like other filesystems means it may be inodes/dentries.
Filling up the page cache with files like you did in your other tests
wouldn't have the same effect. Just my guess here.

We had some experience years ago with the opposite sort of problem. We
have a big ftp server, and we want to *keep* inode/dentry data in the
linux cache, as there are often stupid numbers of files in directories.
Files were always flowing through the server, so the page cache would
force out the inode cache. Was surprised to find with linux there's no
ability to set a fixed inode cache size - the best you can do is
suggest with the cache pressure tunable.

Scott

On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:

I tried to change swapiness from 0 to 95 but it did not have any
impact on the
system overhead.

r.






smime.p7s
Description: S/MIME Cryptographic Signature
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss