[lustre-discuss] Is IML abandonded?
I see the whamcloud repo is abandoned and I don't really see any forks or other information out there, does anyone knows if this is abandoned? https://github.com/whamcloud/integrated-manager-for-lustre Scott ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre [2.8.0] and the Linux Automounter
We use them fairly extensively with 2.7 and 2.8 and find them useful and stable. For cluster nodes, perhaps not a big deal, but for our many non-cluster clients it is useful. We have multiple filesystems so sometimes many mounts. The main benefit I think is not that sometimes they unmount, but if there are ever issues and some filesystems temprorarily not available – much, much less pain. Scott From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf Of Chad DeWitt Sent: Monday, June 19, 2017 8:52 AM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] Lustre [2.8.0] and the Linux Automounter Good morning, All. We are considering using Lustre [2.8.0] with the Linux automounter. Is anyone using this combination successfully? Are there any caveats? (I did check JIRA, but only found two tickets concerning 1.x Lustre.) Thank you in advance, Chad Chad DeWitt, CISSP | HPC Storage Administrator UNC Charlotte | ITS – University Research Computing smime.p7s Description: S/MIME cryptographic signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] lshowmount equivalent?
On 12/14/2015 12:43 AM, Dilger, Andreas wrote: ... Is this a tool that you are using? IIRC, there wasn't a particular reason that it was removed, except that when we asked LLNL (the authors) they said they were no longer using it, and we couldn't find anyone that was using it so it was removed in commit b5a7260ae8f along with a bunch of other old tools. Thanks for the reply, indeed we were using it. We don't use it daily, but when doing some things it is really convenient. If there is a demand for lshowmount I don't think it would be hard to reinstate. If it makes more sense for it to be a separate tool outside the lustre code base, that'd be fine too I think. Thanks, Scott smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] lshowmount equivalent?
We noticed with recent versions of lustre lshowmount has disappeared. Annoyingly, it's still in the lustre docs, this is at least noted with a ticket: https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438219_64286 https://jira.hpdd.intel.com/browse/LUDOC-308 It's one of those tools that is very handy if you know about it, or was while it was there... Is there a new command that is similar that isn't documented or something? Thanks, Scott ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Large-scale UID/GID changes via lfs
We use robinhood policy engine to do this for GIDs to provide a fake group quota sort of thing. If you have to do this routinely, look into robinhood. Regards, Scott On 10/14/2015 9:52 PM, Ms. Megan Larko wrote: Hello, I have been able to successfully use "lfs lsetfacl ." to set and modify permissions on a Lustre file system quickly with a small system because the lfs is directed at the Lustre MDT. It is similar, I imagine, to using "lfs find..." to search a Lustre fs compared with a *nix "find..." command, the latter which must touch every stripe located on any OST. So, how do a change UID and/or GID over a Lustre file system? Doing a *nix find and chown seems to have the same detrimental performance. >lfs lgetfact my.file The above returns the file ACL info. I can change permissions and add a group or user access/perm but I don't know how to change the "header" information. (To see the difference in header information, one could try "lfs lgetfact --no-head my.file" which shows the ACL info without the header.) >lfs lsetfacl -muser:newPerson:rwx my.file The above adds user with those perms to the original user listed in the header info. This is using Lustre version 2.6.x (forgot minor number). on RHEL 6.5. Suggestions genuinely appreciated. Cheers, megan ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] lustre client server interoperability
I'd just add that we've been generally OK for running a variety of 2.X servers vs 2.whatever clients. For our latest project I hear we're going to try 2.7 server and 2.8 client. The client machines for us are much more likely to need OS versions pushed forward. Regarding interim patches to Lustre, my feeling is the important thing is to simply know what patches are critical. I believe all the patches are still public from Intel (and how about others people providing lustre patches). There has been some discussion about sharing information on people's patch sets on wiki.lustre.org, but I haven't see anything come out. Patrick, is Cray providing public maintenance releases? Or sharing information on important patches? Scott On 8/12/2015 7:32 AM, Patrick Farrell wrote: Jon, You've got the interop right. Unfortunately, Intel is no longer doing public maintenance versions of Lustre, so 2.8 will not receive updates after release. - Patrick From: Jon Tegner [jon.teg...@foi.se] Sent: Wednesday, August 12, 2015 1:16 AM To: Patrick Farrell; Kurt Strosahl Cc: lustre-discuss@lists.lustre.org; Jan Pettersson Subject: SV: lustre client server interoperability So if I understand correctly one has the following centos options: 1. Lustre-2.5.3 with CentOS-6 on both clients and servers. 2. Lustre-2.5.3, CentOS-6 on servers, and 2.7.0 and CentOS-7 on clients. 3. Wait a while and use Lustre-2.8.0/CentOS-7 on clients and servers. At least on clients I would prefer to run CentOS-7, but if 2.7 (and 2.8 - will this version receive updates?) are less reliable that might not be a good idea? Any thoughts on this would be greatly appreciated. Thanks! /jon Från: lustre-discuss lustre-discuss-boun...@lists.lustre.org för Patrick Farrell p...@cray.com Skickat: den 11 augusti 2015 21:23 Till: Kurt Strosahl Kopia: lustre-discuss@lists.lustre.org Ämne: Re: [lustre-discuss] lustre client server interoperability No - 2.5 is the last public stable client release from Intel. On 8/11/15, 2:22 PM, Kurt Strosahl stros...@jlab.org wrote: So is there a stable client for centos 7 that is backwards compatible with 2.5.3? w/r, Kurt - Original Message - From: Patrick Farrell p...@cray.com To: Kurt Strosahl stros...@jlab.org, lustre-discuss@lists.lustre.org Sent: Monday, August 10, 2015 4:24:15 PM Subject: RE: lustre client server interoperability Kurt, Yes. It's worth noting that 2.7 is probably marginally less reliable than 2.5, since it has had no updates/fixes since it was released. From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of Kurt Strosahl [stros...@jlab.org] Sent: Monday, August 10, 2015 2:25 PM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] lustre client server interoperability Hello, Is the 2.7 lustre client compatible with lustre 2.5.3 servers? I'm running a 2.5.3 system lustre system, but have been asked by a few people about upgrading some of our clients to CentOS 7 (which appears to need a 2.7 or greater client). w/r, Kurt J. Strosahl System Administrator Scientific Computing Group, Thomas Jefferson National Accelerator Facility ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Size difference between du and quota
Another thing to think about is does he perhaps own files outside of his directory? The quota is on the volume but you are only doing du on the directory. Even if he's not aware of it, things can happen like people using rsync and preserving ownership. The original owner's usage then goes up. Scott On 5/20/2015 3:50 AM, Phill Harvey-Smith wrote: Hi all, One of my users is reporting a massive size difference between the figures reported by du and quota. doing a du -hs on his directory reports : du -hs . 529G. doing a lfs quota -u username /storage reports Filesystem kbytes quota limit grace files quota limit grace /storage 621775192 64000 64001 - 601284 100 110 - Though this user does have a lot of files : find . -type f | wc -l 581519 So I suspect that it is the typical thing that quota is reporting used blocks whilst du is reporting used bytes, which can of course be wildly different due to filesystem overhead and wasted unused space at the end of files where a block is allocated but only partially used. Is this likely to be the case ? I'm also not entirely sure what versions of lustre the client machines and MDS / OSS servers are running, as I didn't initially set the system up. Cheers. Phill. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] OpenSFS / EOFS Presentations index
It would be really convenient if all the presentations for various LUG, LAD, and similar meetings were available in one page. Ideally there would also be some kind of keywords for each presentation for easy searches, but even just having a comprehensive list of links would be valuable I think. Scott ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation
I just want to second what Rick said - It's create/remove not stat of files where there are performance penalties. We covered this issue for our workload just by using SSD's for our mdt, when normally we'd just use fast SAS drives. A bigger deal for us was RAM on the server, and improvements with SPL 0.6.3+ Scott It's Lustre on ZFS, especially for metadata operations that create, modify, or remove inodes. Native ZFS metadata operations are much faster than what Lustre on ZFS is currently providing. That said, we've gone with a ZFS-based patchless MDS, since read operations have always been more critical for us, and our performance is more than adequate. --Rick smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] jobstats, SLURM_JOB_ID, array jobs and pain.
Has anyone been working with the lustre jobstats feature and SLURM? We have been, and it's OK. But now that I'm working on systems that run a lot of array jobs and a fairly recent slurm version we found some ugly stuff. Array jobs report their do SLURM_JOBID as a variable, and it's unique for every job. But they use other IDs too that appear only for array jobs. http://slurm.schedmd.com/job_array.html However, that unique SLURM_JOBID as far as I can tell is only truly exposed in command line tools via 'scontrol' - which is only valid while the job is running. If you want to look at older jobs with sacct for example, things are troublesome. Here's what my coworker and I have figured out: - You submit a (non-array) job that gets jobid 100. - The next job gets jobid 101. - Then submit a 10 task array job. That gets jobid 102. The sub tasks get 9 more job ids. If nothing else is happening with the system, that means you use jobid 102 to 112. If things were that orderly, you could cope with using SLURM_JOB_ID in lustre jobstats pretty easily. Use sacct and you see job 102_2 - you know that is jobid 103 in lustre jobstats. But, if other jobs get submitted during set up (as of course they do), they can take jobid 103. So, you've got problems. I think we may try to set a magic variable in the slurm prolog and use that for the jobstats_var, but who knows. Scott smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] ZFS pages on lustre.wiki.org
Chris and others, I have moved my ZFS notes to the lustre wiki and made a category - http://wiki.lustre.org/Category:ZFS I am hoping others will post their notes or useful tips, and have talked to a few people. Can you make a top level link to the category? Thank you, Scott ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] OST partition sizes
Ok I looked up my notes. I'm not really sure what you mean by record size. I assumed when I do a file per process the block size = file size. And that's what I see dropped on the filesystem. I did -F -b size With block sizes 1MB, 20MB, 100MB, 200MB, 500MB 2, 4, 8, 16 threads on 1 to 4 clients. I assumed 2 threads on 1 client looks a lot like a client writing or reading 2 files. I didn't bother looking at 1 thread. Later I just started doing 100MB tests since it's a very common file size for us. Plus I didn't see real big difference once size gets bigger than that. Scott On 4/29/2015 10:24 AM, Alexander I Kulyavtsev wrote: What range of record sizes did you use for IOR? This is more important than file size. 100MB is small, overall data size (# of files) shall be twice as memory. I ran series of test for small record size for raidz2 10+2; will re-run some tests after upgrading to 0.6.4.1 . Single file performance differs substantially from file per process. Alex. On Apr 29, 2015, at 9:38 AM, Scott Nolin scott.no...@ssec.wisc.edu mailto:scott.no...@ssec.wisc.edu wrote: I used IOR, singlefile, 100MB files. That's the most important workload for us. I tried several different file sizes, but 100MB seemed a reasonable compromise for what I see the most. We rarely or never do file striping. I remember I did see a difference between 10+2 and 8+2. Especially at smaller numbers of clients and threads, the 8+2 performance numbers were more consistent, made a smoother curve. 10+2 with not a lot of threads the performance was more variable. smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] OST partition sizes
Ah, I used 256K xfersize for all my tests. 1MB would probably be a better test. Scott On 4/29/2015 11:38 AM, Alexander I Kulyavtsev wrote: ior/bin/IOR.mpiio.mvapich2-2.0b -h -t N transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g) IOR reports it in the log : Command line used: /home/aik/lustre/benchmark/git/ior/bin/IOR.mpiio.mvapich2-2.0b -v -a MPIIO -i5 -g -e -w -r -b 16g -C -t 8k -o /mnt/lfs/admin/iotest/ior/stripe_2/ior-testfile.ssf ... Summary: api= MPIIO (version=3, subversion=0) test filename = /mnt/lfs/admin/iotest/ior/stripe_2/ior-testfile.ssf access = single-shared-file, independent pattern= segmented (1 segment) ordering in a file = sequential offsets ordering inter file=constant task offsets = 1 clients= 32 (8 per node) repetitions= 5 xfersize = 8192 bytes blocksize = 16 GiB aggregate filesize = 512 GiB Here we have xfersize 8k, each client of 32 writes 16GB, so the aggregate file size is 512GB. I would expect records size to be ~1MB for our workloads. Best regards, Alex. On Apr 29, 2015, at 11:07 AM, Scott Nolin scott.no...@ssec.wisc.edu mailto:scott.no...@ssec.wisc.edu wrote: Ok I looked up my notes. I'm not really sure what you mean by record size. I assumed when I do a file per process the block size = file size. And that's what I see dropped on the filesystem. I did -F -b size With block sizes 1MB, 20MB, 100MB, 200MB, 500MB 2, 4, 8, 16 threads on 1 to 4 clients. I assumed 2 threads on 1 client looks a lot like a client writing or reading 2 files. I didn't bother looking at 1 thread. Later I just started doing 100MB tests since it's a very common file size for us. Plus I didn't see real big difference once size gets bigger than that. Scott On 4/29/2015 10:24 AM, Alexander I Kulyavtsev wrote: What range of record sizes did you use for IOR? This is more important than file size. 100MB is small, overall data size (# of files) shall be twice as memory. I ran series of test for small record size for raidz2 10+2; will re-run some tests after upgrading to 0.6.4.1 . Single file performance differs substantially from file per process. Alex. On Apr 29, 2015, at 9:38 AM, Scott Nolin scott.no...@ssec.wisc.edu mailto:scott.no...@ssec.wisc.edu mailto:scott.no...@ssec.wisc.edu wrote: I used IOR, singlefile, 100MB files. That's the most important workload for us. I tried several different file sizes, but 100MB seemed a reasonable compromise for what I see the most. We rarely or never do file striping. I remember I did see a difference between 10+2 and 8+2. Especially at smaller numbers of clients and threads, the 8+2 performance numbers were more consistent, made a smoother curve. 10+2 with not a lot of threads the performance was more variable. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org mailto:lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org smime.p7s Description: S/MIME Cryptographic Signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] New community release model and 2.5.3 (and 2.x.0) patch lists?
Since Intel will not be making community releases for 2.5.4 or 2.x.0 releases now, it seems the community will need to maintain some sort of patch list against these releases. Especially stability, data corruption, and security patches. I think this is important so if people are trying a Lustre community release they need to be aware of any bugs that might exist, and if they're addressed. If things are unstable, lustre will (re)gain a negative reputation as a file system you should not trust with real data. I don't have any answers here, but would like to start a wider conversation. Thanks, Scott ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] Interpreting stats files
Here's how I understand it: First number = number of times (samples) the OST has handled a read or write. Second number = the minimum read/write size Third number = maximum read/write size Fourth = sum of all the read/write requests in bytes, the quantity of data read/written. Since this is in the exports area, it's all that per export of course. I am working on a wiki page for sharing information on stats details like this which hopefully will be available soon, it's almost ready. It will include links to tools like lltop, xltop, and also how to roll your own. Scott On 11/7/2014 2:29 PM, Dragseth Roy Einar wrote: Is there a description of the stats file formats anywhere? I'm especially interested in the /proc/fs/lustre/obdfilter/*/exports/*/stats files. For instance, what does the last three numbers on the read_bytes or write_bytes lines mean? [root@oss1 ~]# cat /proc/fs/lustre/obdfilter/uit-OST0009/exports/192.168.255.161@o2ib/stats snapshot_time 1415391739.442581 secs.usecs read_bytes365486 samples [bytes] 2720 1048576 363509372808 write_bytes 12183 samples [bytes] 7384 1048576 12762602232 preprw377677 samples [reqs] commitrw 377669 samples [reqs] ping 832 samples [reqs] (I'm trying to create a simple tool to scan all OSSes and detect the worst IO hogs on our system) Regards, Roy. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre and ZFS notes available
Hi Andrew, Much of this information is notes and not in a finished format, so it's a problem of how much time we have. The other issue is contributing to the manual is somewhat cumbersome as you have to submit patches - https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual+source The bar there is a bit higher - we have to be pretty confident the information that's added is correct, know if it's just for some versions of lustre, and so on. As opposed to simply here are our notes that work for us. I will try to review what we have and if anything looks really incorrect or missing in the lustre manual we will attempt to issue a patch. I think in general the lustre manual is correct, but not always sufficient. I think the process does make sure incorrect stuff doesn't go in at least, but makes it hard to add information. Scott On 8/14/2014 6:13 AM, Andrew Holway wrote: Hi Scott, Great job! Would you consider merging with the standard Lustre docs? https://wiki.hpdd.intel.com/display/PUB/Documentation Thanks, Andrew On 12 August 2014 18:58, Scott Nolin scott.no...@ssec.wisc.edu mailto:scott.no...@ssec.wisc.edu wrote: Hello, At UW SSEC my group has been using Lustre for a few years, and recently Lustre with ZFS as the back end file system. We have found the Lustre community very open and helpful in sharing information. Specifically information from various LUG and LAD meetings and the mailing lists has been very helpful. With this in mind we would like to share some of our internal documentation and notes that may be useful to others. These are working notes, so not a complete guide. I want to be clear that the official Lustre documentation should be considered the correct reference material in general. But this information may be helpful for some - http://www.ssec.wisc.edu/~__scottn/ http://www.ssec.wisc.edu/~scottn/ Topics that I think of particular interest may be lustre zfs install notes and JBOD monitoring. Scott Nolin UW SSEC ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org mailto:Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Lustre and ZFS notes available
Hello, At UW SSEC my group has been using Lustre for a few years, and recently Lustre with ZFS as the back end file system. We have found the Lustre community very open and helpful in sharing information. Specifically information from various LUG and LAD meetings and the mailing lists has been very helpful. With this in mind we would like to share some of our internal documentation and notes that may be useful to others. These are working notes, so not a complete guide. I want to be clear that the official Lustre documentation should be considered the correct reference material in general. But this information may be helpful for some - http://www.ssec.wisc.edu/~scottn/ Topics that I think of particular interest may be lustre zfs install notes and JBOD monitoring. Scott Nolin UW SSEC smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] number of inodes in zfs MDT
We ran some scrub performance tests, and even without tunables set it wasn't too bad, for our specific configuration. The main thing we did was verify it made sense to scrub all OSTs simultaneously. Anyway, indeed scrub or resilver aren't about Defrag. Further, the mds performance issues aren't about fragmentation. A side note, it's probably ideal to stay below 80% due to fragmentation for ldiskfs too or performance degrades. Sean, note I am dealing with specific issues for a very create intense workload, and this is on the mds only where we may change. The data integrity features of Zfs make it very attractive too. I fully expect things will improve too with Zfs. If you want a lot of certainty in your choices, you may want to consult various vendors if lustre systems. Scott On June 8, 2014 11:42:15 AM CDT, Dilger, Andreas andreas.dil...@intel.com wrote: Scrub and resilver have nothing to so with defrag. Scrub is scanning of all the data blocks in the pool to verify their checksums and parity to detect silent data corruption, and rewrite the bad blocks if necessary. Resilver is reconstructing a failed disk onto a new disk using parity or mirror copies of all the blocks on the failed disk. This is similar to scrub. Both scrub and resilver can be done online, though resilver of course requires a spare disk to rebuild onto, which may not be possible to add to a running system if your hardware does not support it. Both of them do not improve the performance or layout of data on disk. They do impact performance because they cause a lot if random IO to the disks, though this impact can be limited by tunables on the pool. Cheers, Andreas On Jun 8, 2014, at 4:21, Sean Brisbane s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.uk wrote: Hi Scott, We are considering running zfs backed lustre and the factor of 10ish performance hit you see worries me. I know zfs can splurge bits of files all over the place by design. The oracle docs do recommend scrubbing the volumes and keeping usage below 80% for maintenance and performance reasons, I'm going to call it 'defrag' but I'm sure someone who knows better will probably correct me as to why it is not the same. So are these performance issues after scubbing and is it possible to scrub online - I.e. some reasonable level of performance is maintained while the scrub happens? Resilvering is also recommended. Not sure if that is for performance reasons. http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html Sent from my HTC Desire C on Three - Reply message - From: Scott Nolin scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu To: Anjana Kar k...@psc.edumailto:k...@psc.edu, lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: [Lustre-discuss] number of inodes in zfs MDT Date: Fri, Jun 6, 2014 3:23 AM Looking at some of our existing zfs filesystems, we have a couple with zfs mdts One has 103M inodes and uses 152G of MDT space, another 12M and 19G. I’d plan for less than that I guess as Mr. Dilger suggests. It all depends on your expected average file size and number of files for what will work. We have run into some unpleasant surprises with zfs for the MDT, I believe mostly documented in bug reports, or at least hinted at. A serious issue we have is performance of the zfs arc cache over time. This is something we didn’t see in early testing, but with enough use it grinds things to a crawl. I believe this may be addressed in the newer version of ZFS, which we’re hopefully awaiting. Another thing we’ve seen, which is mysterious to me is this it appears hat as the MDT begins to fill up file create rates go down. We don’t really have a strong handle on this (not enough for a bug report I think), but we see this: 1. The aforementioned 104M inode / 152GB MDT system has 4 SAS drives raid10. On initial testing file creates were about 2500 to 3000 IOPs per second. Follow up testing in it’s current state (about half full..) shows them at about 500 IOPs now, but with a few iterations of mdtest those IOPs plummet quickly to unbearable levels (like 30…). 2. We took a snapshot of the filesystem and sent it to the backup MDS, this time with the MDT built on 4 SAS drives in a raid0 - really not for performance so much as “extra headroom” if that makes any sense. Testing this the IOPs started higher, at maybe 800 or 1000 (this is from memory, I don’t have my data in front of me). That initial faster speed could just be writing to 4 spindles I suppose, but surprising to me, the performance degraded at a slower rate. It took much longer to get painfully slow. It still got there. The performance didn’t degrade at the same rate if that makes sense - the same number of writes on the smaller/slower mdt degraded the performance more quickly. My guess is that had something to do with the total space available
Re: [Lustre-discuss] number of inodes in zfs MDT
Just a note, I see zfs-0.6.3 has just been annoounced: https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-announce/Lj7xHtRVOM4 I also see it is upgraded in the zfs/lustre repo. The changelog notes the default as changed to 3/4 arc_c_max and a variety of other fixes, many focusing on performance. So Anjana this is probably worth testing, especially if you're considering drastic measures. We upgraded for our MDS, so this file create issue is harder for us to test now (literally started testing writes this afternoon, and it's not degraded yet, so far at 20 million writes). Since your problem still happens fairly quickly I'm sure any information you have will be very helpful to add to LU-2476. And if it helps, it may save you some pain. We will likely install the upgrade but may not be able to test millions of writes any time soon, as the filesystem is needed for production. Regards, Scott On Thu, 12 Jun 2014 16:41:14 + Dilger, Andreas andreas.dil...@intel.com wrote: It looks like you've already increased arc_meta_limit beyond the default, which is c_max / 4. That was critical to performance in our testing. There is also a patch from Brian that should help performance in your case: http://review.whamcloud.com/10237 Cheers, Andreas On Jun 11, 2014, at 12:53, Scott Nolin scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu wrote: We tried a few arc tunables as noted here: https://jira.hpdd.intel.com/browse/LU-2476 However, I didn't find any clear benefit in the long term. We were just trying a few things without a lot of insight. Scott On 6/9/2014 12:37 PM, Anjana Kar wrote: Thanks for all the input. Before we move away from zfs MDT, I was wondering if we can try setting zfs tunables to test the performance. Basically what's a value we can use for arc_meta_limit for our system? Are there are any others settings that can be changed? Generating small files on our current system, things started off at 500 files/sec, then declined so it was about 1/20th of that after 2.45 million files. -Anjana On 06/09/2014 10:27 AM, Scott Nolin wrote: We ran some scrub performance tests, and even without tunables set it wasn't too bad, for our specific configuration. The main thing we did was verify it made sense to scrub all OSTs simultaneously. Anyway, indeed scrub or resilver aren't about Defrag. Further, the mds performance issues aren't about fragmentation. A side note, it's probably ideal to stay below 80% due to fragmentation for ldiskfs too or performance degrades. Sean, note I am dealing with specific issues for a very create intense workload, and this is on the mds only where we may change. The data integrity features of Zfs make it very attractive too. I fully expect things will improve too with Zfs. If you want a lot of certainty in your choices, you may want to consult various vendors if lustre systems. Scott On June 8, 2014 11:42:15 AM CDT, Dilger, Andreas andreas.dil...@intel.commailto:andreas.dil...@intel.com wrote: Scrub and resilver have nothing to so with defrag. Scrub is scanning of all the data blocks in the pool to verify their checksums and parity to detect silent data corruption, and rewrite the bad blocks if necessary. Resilver is reconstructing a failed disk onto a new disk using parity or mirror copies of all the blocks on the failed disk. This is similar to scrub. Both scrub and resilver can be done online, though resilver of course requires a spare disk to rebuild onto, which may not be possible to add to a running system if your hardware does not support it. Both of them do not improve the performance or layout of data on disk. They do impact performance because they cause a lot if random IO to the disks, though this impact can be limited by tunables on the pool. Cheers, Andreas On Jun 8, 2014, at 4:21, Sean Brisbane s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.uk wrote: Hi Scott, We are considering running zfs backed lustre and the factor of 10ish performance hit you see worries me. I know zfs can splurge bits of files all over the place by design. The oracle docs do recommend scrubbing the volumes and keeping usage below 80% for maintenance and performance reasons, I'm going to call it 'defrag' but I'm sure someone who knows better will probably correct me as to why it is not the same. So are these performance issues after scubbing and is it possible to scrub online - I.e. some reasonable level of performance is maintained while the scrub happens? Resilvering is also recommended. Not sure if that is for performance reasons. http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html Sent from my HTC Desire C on Three - Reply message - From: Scott Nolin scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu To: Anjana Kar k
Re: [Lustre-discuss] number of inodes in zfs MDT
We tried a few arc tunables as noted here: https://jira.hpdd.intel.com/browse/LU-2476 However, I didn't find any clear benefit in the long term. We were just trying a few things without a lot of insight. Scott On 6/9/2014 12:37 PM, Anjana Kar wrote: Thanks for all the input. Before we move away from zfs MDT, I was wondering if we can try setting zfs tunables to test the performance. Basically what's a value we can use for arc_meta_limit for our system? Are there are any others settings that can be changed? Generating small files on our current system, things started off at 500 files/sec, then declined so it was about 1/20th of that after 2.45 million files. -Anjana On 06/09/2014 10:27 AM, Scott Nolin wrote: We ran some scrub performance tests, and even without tunables set it wasn't too bad, for our specific configuration. The main thing we did was verify it made sense to scrub all OSTs simultaneously. Anyway, indeed scrub or resilver aren't about Defrag. Further, the mds performance issues aren't about fragmentation. A side note, it's probably ideal to stay below 80% due to fragmentation for ldiskfs too or performance degrades. Sean, note I am dealing with specific issues for a very create intense workload, and this is on the mds only where we may change. The data integrity features of Zfs make it very attractive too. I fully expect things will improve too with Zfs. If you want a lot of certainty in your choices, you may want to consult various vendors if lustre systems. Scott On June 8, 2014 11:42:15 AM CDT, Dilger, Andreas andreas.dil...@intel.com wrote: Scrub and resilver have nothing to so with defrag. Scrub is scanning of all the data blocks in the pool to verify their checksums and parity to detect silent data corruption, and rewrite the bad blocks if necessary. Resilver is reconstructing a failed disk onto a new disk using parity or mirror copies of all the blocks on the failed disk. This is similar to scrub. Both scrub and resilver can be done online, though resilver of course requires a spare disk to rebuild onto, which may not be possible to add to a running system if your hardware does not support it. Both of them do not improve the performance or layout of data on disk. They do impact performance because they cause a lot if random IO to the disks, though this impact can be limited by tunables on the pool. Cheers, Andreas On Jun 8, 2014, at 4:21, Sean Brisbane s.brisba...@physics.ox.ac.ukmailto:s.brisba...@physics.ox.ac.uk wrote: Hi Scott, We are considering running zfs backed lustre and the factor of 10ish performance hit you see worries me. I know zfs can splurge bits of files all over the place by design. The oracle docs do recommend scrubbing the volumes and keeping usage below 80% for maintenance and performance reasons, I'm going to call it 'defrag' but I'm sure someone who knows better will probably correct me as to why it is not the same. So are these performance issues after scubbing and is it possible to scrub online - I.e. some reasonable level of performance is maintained while the scrub happens? Resilvering is also recommended. Not sure if that is for performance reasons. http://docs.oracle.com/cd/E23824_01/html/821-1448/zfspools-4.html Sent from my HTC Desire C on Three - Reply message - From: Scott Nolin scott.no...@ssec.wisc.edumailto:scott.no...@ssec.wisc.edu To: Anjana Kar k...@psc.edumailto:k...@psc.edu, lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org Subject: [Lustre-discuss] number of inodes in zfs MDT Date: Fri, Jun 6, 2014 3:23 AM Looking at some of our existing zfs filesystems, we have a couple with zfs mdts One has 103M inodes and uses 152G of MDT space, another 12M and 19G. I’d plan for less than that I guess as Mr. Dilger suggests. It all depends on your expected average file size and number of files for what will work. We have run into some unpleasant surprises with zfs for the MDT, I believe mostly documented in bug reports, or at least hinted at. A serious issue we have is performance of the zfs arc cache over time. This is something we didn’t see in early testing, but with enough use it grinds things to a crawl. I believe this may be addressed in the newer version of ZFS, which we’re hopefully awaiting. Another thing we’ve seen, which is mysterious to me is this it appears hat as the MDT begins to fill up file create rates go down. We don’t really have a strong handle on this (not enough for a bug report I think), but we see this: 1. The aforementioned 104M inode / 152GB MDT system has 4 SAS drives raid10. On initial testing file creates were about 2500 to 3000 IOPs per second. Follow up testing in it’s current state (about half full..) shows them at about 500 IOPs now, but with a few
Re: [Lustre-discuss] number of inodes in zfs MDT
Looking at some of our existing zfs filesystems, we have a couple with zfs mdts One has 103M inodes and uses 152G of MDT space, another 12M and 19G. I’d plan for less than that I guess as Mr. Dilger suggests. It all depends on your expected average file size and number of files for what will work. We have run into some unpleasant surprises with zfs for the MDT, I believe mostly documented in bug reports, or at least hinted at. A serious issue we have is performance of the zfs arc cache over time. This is something we didn’t see in early testing, but with enough use it grinds things to a crawl. I believe this may be addressed in the newer version of ZFS, which we’re hopefully awaiting. Another thing we’ve seen, which is mysterious to me is this it appears hat as the MDT begins to fill up file create rates go down. We don’t really have a strong handle on this (not enough for a bug report I think), but we see this: The aforementioned 104M inode / 152GB MDT system has 4 SAS drives raid10. On initial testing file creates were about 2500 to 3000 IOPs per second. Follow up testing in it’s current state (about half full..) shows them at about 500 IOPs now, but with a few iterations of mdtest those IOPs plummet quickly to unbearable levels (like 30…). We took a snapshot of the filesystem and sent it to the backup MDS, this time with the MDT built on 4 SAS drives in a raid0 - really not for performance so much as “extra headroom” if that makes any sense. Testing this the IOPs started higher, at maybe 800 or 1000 (this is from memory, I don’t have my data in front of me). That initial faster speed could just be writing to 4 spindles I suppose, but surprising to me, the performance degraded at a slower rate. It took much longer to get painfully slow. It still got there. The performance didn’t degrade at the same rate if that makes sense - the same number of writes on the smaller/slower mdt degraded the performance more quickly. My guess is that had something to do with the total space available. Who knows. I believe restarting lustre (and certainly rebooting) ‘resets the clock’ on the file create performance degradation. For that problem we’re just going to try adding 4 SSD’s, but it’s an ugly problem. Also are once again hopeful new zfs version addresses it. And finally, we’ve got a real concern with snapshot backups of the MDT that my colleague posted about - the problem we see manifests in essentially a read-only recovered file system, so it’s a concern and not quite terrifying. All in all, the next lustre file system we bring up (in a couple weeks) we are very strongly considering going with ldiskfs for the MDT this time. Scott From: Anjana Kar Sent: Tuesday, June 3, 2014 7:38 PM To: lustre-discuss@lists.lustre.org Is there a way to set the number of inodes for zfs MDT? I've tried using --mkfsoptions=-N value mentioned in lustre 2.0 manual, but it fails to accept it. We are mirroring 2 80GB SSDs for the MDT, but the number of inodes is getting set to 7 million, which is not enough for a 100TB filesystem. Thanks in advance. -Anjana Kar Pittsburgh Supercomputing Center k...@psc.edu ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Building Lustre 2.4.3 Client on EL5?
We have had success with rhel5 2.1.x clients and 2.4.0-1 servers. On a test system with 2.5 servers we've also just rolled (rhel6) clients back to 2.1.6 due to what looks like a client bug and it seems to work too. Scott On 5/7/2014 10:47 AM, Jones, Peter A wrote: Mike I would think either a 1.8.x-wc1 or 2.1.x release should allow you to have RHEL5 clients with 2.4.3 servers Peter On 5/7/14, 8:35 AM, Mike Hanby mha...@uab.edumailto:mha...@uab.edu wrote: If it's not possible, what is the recommended client version to use on EL5 clients with 2.4.3 backend Lustre servers? Thanks, Mike From: Mike Hanby mha...@uab.edu Sent: Wednesday, May 07, 2014 10:33AM To: lustre-discuss@lists.lustre.orgmailto:lustre-discuss@lists.lustre.org lustre-discuss@lists.lustre.org Subject: [Lustre-discuss] Building Lustre 2.4.3 Client on EL5? Howdy, Is it possible to build the Lustre 2.4.3 client to run on RHEL/CentOS 5.10 x86_64? Or is any version past 2.3 in capable of building and functioning on the latest EL5? Thanks, Mike ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.orgmailto:Lustre-discuss@lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [zfs-discuss] Problems getting Lustre started with ZFS
You can do this with the --index option to mkfs, for example: mkfs.lustre --fsname=(name) --ost --backfstype=zfs --index=0 --mgsnode=(etc) Makes , --index=1 makes 1, and so on. Scott On 10/24/2013 2:35 AM, Andrew Holway wrote: You need to use unique index numbers for each OST, i.e. OST, OST1, etc. I cannot see how to control this? I am creating new OST's but they are all getting the same index number. Could this be a problem with the mgs? Thanks, Andrew Ned To unsubscribe from this group and stop receiving emails from it, send an email to zfs-discuss+unsubscr...@zfsonlinux.org. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs for MDT and zfs for OSTs?
I would check to make sure your ldev.conf file is set up with the lustre-ost0 and host name properly. Scott On 10/8/2013 10:40 AM, Anjana Kar wrote: The git checkout was on Sep. 20. Was the patch before or after? The zpool create command successfully creates a raidz2 pool, and mkfs.lustre does not complain, but [root@cajal kar]# zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT lustre-ost0 36.2T 2.24M 36.2T 0% 1.00x ONLINE - [root@cajal kar]# /usr/sbin/mkfs.lustre --fsname=cajalfs --ost --backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0 [root@cajal kar]# /sbin/service lustre start lustre-ost0 lustre-ost0 is not a valid lustre label on this node I think we'll be splitting up the MDS and OSTs on 2 nodes as some of you said there could be other issues down the road, but thanks for all the good suggestions. -Anjana On 10/07/2013 07:24 PM, Ned Bass wrote: I'm guessing your git checkout doesn't include this commit: * 010a78e Revert LU-3682 tunefs: prevent tunefs running on a mounted device It looks like the LU-3682 patch introduced a bug that could cause your issue, so its reverted in the latest master. Ned On Mon, Oct 07, 2013 at 04:54:13PM -0400, Anjana Kar wrote: On 10/07/2013 04:27 PM, Ned Bass wrote: On Mon, Oct 07, 2013 at 02:23:32PM -0400, Anjana Kar wrote: Here is the exact command used to create a raidz2 pool with 8+2 drives, followed by the error messages: mkfs.lustre --fsname=cajalfs --reformat --ost --backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0/ost0 raidz2 /dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdi /dev/sdk /dev/sdm /dev/sdo /dev/sdq /dev/sds mkfs.lustre FATAL: Invalid filesystem name /dev/sds It seems that either the version of mkfs.lustre you are using has a parsing bug, or there was some sort of syntax error in the actual command entered. If you are certain your command line is free from errors, please post the version of lustre you are using, or report the bug in the Lustre issue tracker. Thanks, Ned For building this server, I followed steps from the walk-thru-build* for Centos 6.4, and added --with-spl and --with-zfs when configuring lustre.. *https://wiki.hpdd.intel.com/pages/viewpage.action?pageId=8126821 spl and zfs modules were installed from source for the lustre 2.4 kernel 2.6.32.358.18.1.el6_lustre2.4 Device sds appears to be valid, but I will try issuing the command using by-path names.. -Anjana ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] ldiskfs for MDT and zfs for OSTs?
Ned Here is the exact command used to create a raidz2 pool with 8+2 drives, followed by the error messages: mkfs.lustre --fsname=cajalfs --reformat --ost --backfstype=zfs --index=0 --mgsnode=10.10.101.171@o2ib lustre-ost0/ost0 raidz2 /dev/sda /dev/sdc /dev/sde /dev/sdg /dev/sdi /dev/sdk /dev/sdm /dev/sdo /dev/sdq /dev/sds mkfs.lustre FATAL: Invalid filesystem name /dev/sds mkfs.lustre FATAL: unable to prepare backend (22) mkfs.lustre: exiting with 22 (Invalid argument) dmesg shows ZFS: Loaded module v0.6.2-1, ZFS pool version 5000, ZFS filesystem version 5 Any suggestions on creating the pool separately? Just make sure you can see /dev/sds in your system - if not, that's your problem. I would also suggest consider building this without using these top level dev names. It is very easy for these to change accidentally. If you're just testing it's fine, but over time it will be a problem. See http://zfsonlinux.org/faq.html#WhatDevNamesShouldIUseWhenCreatingMyPool I like the vdev_id.conf with meaningful (to our sysadmins) aliases to device 'by-path'. Scott smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.
You might also try increasing the vfs_cache_pressure. This will reclaim inode and dentry caches faster. Maybe that's the problem, not page caches. To be clear - I have no deep insight into Lustre's use of the client cache, but you said you has lots of small files, which if lustre uses the cache system like other filesystems means it may be inodes/dentries. Filling up the page cache with files like you did in your other tests wouldn't have the same effect. Just my guess here. We had some experience years ago with the opposite sort of problem. We have a big ftp server, and we want to *keep* inode/dentry data in the linux cache, as there are often stupid numbers of files in directories. Files were always flowing through the server, so the page cache would force out the inode cache. Was surprised to find with linux there's no ability to set a fixed inode cache size - the best you can do is suggest with the cache pressure tunable. Scott On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: I tried to change swapiness from 0 to 95 but it did not have any impact on the system overhead. r. On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote: No, I cannot detect any swap activity on the system. r. On Thursday 22. August 2013 09.21.33 you wrote: Is this slowdown due to increased swap activity? If yes, then try lowering the swappiness value. This will sacrifice buffer cache space to lower swap activity. Take a look at http://en.wikipedia.org/wiki/Swappiness. Roger S. On 08/22/2013 08:51 AM, Roy Dragseth wrote: We have just discovered that a large buffer cache generated from traversing a lustre file system will cause a significant system overhead for applications with high memory demands. We have seen a 50% slowdown or worse for applications. Even High Performance Linpack, that have no file IO whatsoever is affected. The only remedy seems to be to empty the buffer cache from memory by running echo 3 /proc/sys/vm/drop_caches Any hints on how to improve the situation is greatly appreciated. System setup: Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and lustre v2.1.6 rpms downloaded from whamcloud download site. Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site). Each OSS has 12 OST, total 1.1 PB storage. How to reproduce: Traverse the lustre file system until the buffer cache is large enough. In our case we run find . -print0 -type f | xargs -0 cat /dev/null on the client until the buffer cache reaches ~15-20GB. (The lustre file system has lots of small files so this takes up to an hour.) Kill the find process and start a single node parallel application, we use HPL (high performance linpack). We run on all 16 cores on the system with 1GB ram per core (a normal run should complete in appr. 150 seconds.) The system monitoring shows a 10-20% system cpu overhead and the HPL run takes more than 200 secs. After running echo 3 /proc/sys/vm/drop_caches the system performance goes back to normal with a run time at 150 secs. I've created an infographic from our ganglia graphs for the above scenario. https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png Attached is an excerpt from perf top indicating that the kernel routine taking the most time is _spin_lock_irqsave if that means anything to anyone. Things tested: It does not seem to matter if we mount lustre over infiniband or ethernet. Filling the buffer cache with files from an NFS filesystem does not degrade performance. Filling the buffer cache with one large file does not give degraded performance. (tested with iozone) Again, any hints on how to proceed is greatly appreciated. Best regards, Roy. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.
I forgot to add 'slabtop' is a nice tool for watching this stuff. Scott On 8/23/2013 9:36 AM, Scott Nolin wrote: You might also try increasing the vfs_cache_pressure. This will reclaim inode and dentry caches faster. Maybe that's the problem, not page caches. To be clear - I have no deep insight into Lustre's use of the client cache, but you said you has lots of small files, which if lustre uses the cache system like other filesystems means it may be inodes/dentries. Filling up the page cache with files like you did in your other tests wouldn't have the same effect. Just my guess here. We had some experience years ago with the opposite sort of problem. We have a big ftp server, and we want to *keep* inode/dentry data in the linux cache, as there are often stupid numbers of files in directories. Files were always flowing through the server, so the page cache would force out the inode cache. Was surprised to find with linux there's no ability to set a fixed inode cache size - the best you can do is suggest with the cache pressure tunable. Scott On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote: I tried to change swapiness from 0 to 95 but it did not have any impact on the system overhead. r. smime.p7s Description: S/MIME Cryptographic Signature ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss