Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount
On Jul 15, 2020, at 12:29 AM, ??? wrote: Is there a ceiling for a Lustre filesystem that can be mounted in a cluster? It is very high, as Andreas said. If so, what's the number? The following contains specific limits: https://build.whamcloud.com/job/lustre-manual//lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140436304680016 You'll notice that you must assume some aspects of configuration, such as the size and number of your OSTs. I see OSTs in the range of 75-400TB (and OST counts between 58 and 187). If not, how much is proper? Lustre is designed to scale. So a config with a small number of OSTs, on very few OSSes doesn't make that much sense. OSTs are pretty much expected to be decent-sized RAIDs. There would be tradeoffs among cost- efficient disk sizes (maybe 16T today) and RAID overhead (usually N+2), and how that trades off with bandwidth (HBA and OSS network). Does mount multiple filesystems can affect the stability of each file system or cause other problems? My experience is that the main factor in reliability is device count, rather than how the devices are organized. For instance, if you have more OSSes, you may get linearly nicer performance, but you also increase your chance of having components crash or fail. The main reason for separate filesystems is usually that the MDS (maybe MTD) can be a bottleneck. But you can scale MDSes, instead. regards, mark hahn. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] kernel panic's with Lustre 2.7
cep4-fs-MDT-osd: Overflow in tracking declares for index, rb = 4 there is a fixed bug in 2.8 (LU-4045) that may address this (at least the "Overflow..index" appears there). ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] multiple filesystem in MGS vs folder based ACL ? prons /cons
FYI, Project Quotas exist beginning with Lustre 2.10.0. but not yet for ZFS configs, right? sorry, I don't remember whether the OP mentioned which underfilesystem they were using... ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] multiple filesystem in MGS vs folder based ACL ? prons /cons
we have different research groups i am thinking to have one filesystem and beneath it using ACL have project folders . well, the first approach should be to use the normal Unix mechanism: owners and groups. ACLs are usually treated as a way to make exceptions, since owner/group will capture most of the correct sharing relations. after all, there's little harm in seeing lots of names in your /project mount. unless someone botches the permissions, only the right people can can traverse the trees. Just curious what re the pros/cons of having multiple filesystem vs single filesystem with folders ? scalability of management. it's not obviously scalable to manage many separate filesystems, but very easy to manage thousands of groups on a single filesystem. any advice ? unless you have to prevent users from even seeing the existence of other users, just use a single filesystem. Lustre's current/traditional owner-based quota accounting is a bit of a drag, but eventually there will be project quotas... regards, mark hahn ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre as /home directory
My question is, Is it advisable to have /home in Lustre since users data will be of small files (less than 5MB)? certainly it works, but is not very pleasant for metadata-intensive activity, such as compiling. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Are there any performance hits with the https://access.redhat.com/security/vulnerabilities/speculativeexecution?
Also to what extent would a Lustre system that is essentially a filer be at risk? It's not running user code and you're not browsing from it... to be vulnerable, attack code must run on the system. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] ZFS-OST layout, number of OSTs
It?s also worth noting that if you have small OSTs it?s much easier to bump into a full OST situation. And specifically, if you singly stripe a file the file size is limited by the size of the OST. is there enough real-life experience to know whether progressive file layout will mitigate this issue? thanks, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Linux users are not able to access lustre folders
No directory /home/luser6 Logging in with home="/". perhaps selinux or the wacky new systemd "protecthome" stuff? ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre [2.8.0] and the Linux Automounter
We are considering using Lustre [2.8.0] with the Linux automounter. Is anyone using this combination successfully? Are there any caveats? fwiw, we use static Lustre mounts and then autofs doing bind mounts if we want to make the exposed namespace tidier. Lustre client mounts are inherently heavier-weight than simple local (or NFS) mounts; I'd definitely try to avoid doing them frequently. Admittedly, we haven't tried Lustre automounts for years, but IMO automounts only make sense when the FS is infrequently used. For us, it would be very strange if Lustre weren't in use - an idle node, basically.) regards, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] RobinHood fail on Lustre IEEL v2.5
2017/06/08 16:19:38 [15616/21] FS_Scan | openat failed on 23/pci-:00:1a.0-usb-0:1.6.1:1.2-event-mouse: Too many levels of symbolic links RBH has strayed into /sys, which is loaded with traps like this... Currently the filesystem is in production. Can this the main reason why it crashes. no, I think RBH is being told to scan from / (perhaps by default) rather than being directed to the Lustre mount point. regards, mark hahn. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] What happens when a file grows
What happens when a file is created without striping and it gets larger then the available space on the OST that it is on? ENOSPC. but I think it's a bit misleading to say "without striping" - leaving the filesystem default set at 1 is still a striping choice... regards, mark hahn. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre on ZFS pooer direct I/O performance
anyway if I force direct I/O, for example using oflag=direct in dd, the write performance drop as low as 8MB/sec with 1MB block size. And each write it's about 120ms latency. but that's quite a small block size. do you approach buffered performance if you write significantly bigger blocks (8-32M)? presumably you're already striping across OSTs? ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Advice Lustre & ZFS.
Isn't hw raid bounds and leagues faster then any software raid that you're ever going to be putting in? certainly not. choice of hw/sw raid is largely down to taste: do you want a opaque/proprietary but more hands-off admin workflow, or do you want direct control of what's happening (failovers, etc). sw raid has been able to efficiently saturate any reasonable set of disks for many years. after all, device-memory IO already goes through the processor, and memory bandwidth is "leagues" greater than IO, and Moore has given us an excess of very fast cores that can each do ~10 GB/s of raid6 calculations... regards, mark hahn. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)
We had to use lustre-2.5.3.90 on the MDS servers because of memory leak. https://jira.hpdd.intel.com/browse/LU-5726 Mark, If you don?t have the patch for LU-5726, then you should definitely try to get that one. If nothing else, reading through the bug report might be useful. It details some of the MDS OOM problems I had and mentions setting vm.zone_reclaim_mode=0. It also has Robin Humble?s suggestion of setting "options libcfs cpu_npartitions=1? (which is something that I started doing as well). thanks, we'll be trying the LU-5726 patch and cpu_npartitions things. it's quite a long thread - do I understand correctly that periodic vm.drop_caches=1 can postpone the issue? I can replicate the warning signs mentioned in the thread (growth of Inactive(file) to dizzying heights when doing a lot of unlinks). It seems odd that if this is purely a memory balance problem, it manifests as a 0xdeadbeef panic, rather than OOM. While I understand that the oom-killer path itself need memory to operate, does this also imply that some allocation in the kernel or filesystem is not checking a return value? thanks, mark hahn. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)
Our problem seems to correlate with an unintentional creation of a tree of >500M files. Some of the crashes we've had since then appeared to be related to vm.zone_reclaim_mode=1. We also enabled quotas right after the 500M file thing, and were thinking that inconsistent quota records might cause this sort of crash. Have you set vm.zone_reclaim_mode=0 yet? I had an issue with this on my file system a while back when it was set to 1. all our existing Lustre MDSes run happily with vm.zone_reclaim_mode=0, and making this one consistent appears to have resolved a problem (in which one family of lustre kernel threads would appear to spin, "perf top" showing nearly all time spent in spinlock_irq. iirc.) might your system have had a *lot* of memory? ours tend to be fairly modest (32-64G, dual-socket intel.) thanks, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)
Giving the rest of the back trace of the crash would help for developers looking at it. It's a lot easier to tell what code is involved with the whole trace. thanks. I'm sure that's the case, but these oopsen are truncated. well, one was slightly longer: BUG: unable to handle kernel paging request at deadbeef IP: [] iam_container_init+0x18/0x70 [osd_ldiskfs] PGD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/online CPU 14 Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 iTCO_wdt iTCO_vendor_support serio_raw raid10 i2c_i801 lpc_ich mfd_core ipmi_devintf mlx4_core sg acpi_pad igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 7768, comm: mdt00_039 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 Supermicro SYS-2027R-WRF/X9DRW by way of straw-grasping, I'll mention two other very frequent messages we're seeing on the MDS in question: Lustre: 17673:0:(mdt_xattr.c:465:mdt_reint_setxattr()) covework-MDT: client miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object [0x200031f84:0x1cad0:0x0]] [valid 68719476736] (which seems to be https://jira.hpdd.intel.com/browse/LU-532 and a consequence of some of our very old clients. but not MDS-crash-able.) LustreError: 22970:0:(tgt_lastrcvd.c:813:tgt_last_rcvd_update()) covework-MDT: trying to overwrite bigger transno:on-disk: 197587694105, new: 197587694104 replay: 0. see LU-617. perplexing because the MDS is 2.5.3 and https://jira.hpdd.intel.com/browse/LU-617 shows fixed circa 2.2.0/2.1.2. (and our problem isn't with recovery afaikt.) thanks! regards, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)
One of our MDSs is crashing with the following: BUG: unable to handle kernel paging request at deadbeef IP: [] iam_container_init+0x18/0x70 [osd_ldiskfs] PGD 0 Oops: 0002 [#1] SMP The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64 with about 2k clients ranging from 1.8.8 to 2.6.0 I'd appreciate any comments on where to point fingers: google doesn't provide anything suggestive about iam_container_init. Our problem seems to correlate with an unintentional creation of a tree of >500M files. Some of the crashes we've had since then appeared to be related to vm.zone_reclaim_mode=1. We also enabled quotas right after the 500M file thing, and were thinking that inconsistent quota records might cause this sort of crash. But 0xdeadbeef is usually added as a canary for allocation issues; is it used this way in Lustre? thanks, Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [Lustre-discuss] recovery from multiple disks failure on the same md
> I'd also recommend to start periodic scrubbing: We do this once per month >with low priority (~5MBPS) with little impact to the users. yes. and if you think a rebuild might overstress marginal disks, throttling via the dev.raid.speed_limit_max sysctl can help. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Failover / reliability using SAD direct-attached storage
> It seems an external fibre > or SAS raid is needed, to be precise, a redundant-path SAN is needed. you could do it with commodity disks and Gb, or you can spend almost unlimited amounts on gold-plated disks, FC switches, etc. the range of costs is really quite remarkable, I guess O(100x). compare this to cars where even VERY nice production cars are only a few times more expensive than the most cost-effective ones. > as the idea of loosing the file system if one > node goes down doesn't seem good, even if temporary. how often do you expect nodes to fail, and why? regards, mark hahn. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] software raid
choosing sw vs hw raid depends entirely on your systems, pocketbook, taste. I think there are two edge cases which are pretty unambiguous: - modern systems have obscene CPU power and memory bandwidth, at least compared to disks. and even compared to the embedded cpu in raid cards. this means that software raid is very fast and attractive for at least moderate numbers of disks. because disks are so incredibly cheap, it's almost a shame not to use the 6+ sata ports present on every motherboard, for instance. - if you need to minimize CPU and memory-bandwidth overheads, or address very large numbers of disks, you want as much hardware assist as you can get even though it's expensive and wimpy. having 100 15k rpm SAS disks as JBOD under SW raid would make little sense, since the disks, expanders, backplanes and controllers overwhelm the cost savings. I think it boils down to your personal weighting of factors in TCO. "classic" best practices, for instance, emphasizes device reliability to maintain extreme uptime and minimize admin monkeywork. that's fine, but it's completely opposite to the less ideological, more market-reality driven approach that recognizes disks cost $30/TB and dropping, and that with appropriate use of redundancy, mass-market hardware can still achive however many nines you set your heart on. it is convenient that a 2u node supporting 6-12 disks can be done with the free/builtin controller with SW raid and delivers bandwidth that matches relevant network interfaces (10G, IB). I like the fact that a single unit like that has no "extra" firmware to maintain, or over-smart controllers to go bonkers. IPMI power control includes the disks. SMART works directly. and in a pinch, the content can be brought online via any old PC. I've used MD since it was new in the kernel, and never had problems with it. regards, mark hahn ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss