Re: [lustre-discuss] Is there aceiling of lustre filesystem a client can mount

2020-07-16 Thread Mark Hahn

On Jul 15, 2020, at 12:29 AM, ???  wrote:

Is there a ceiling for a Lustre filesystem that can be mounted in a cluster?


It is very high, as Andreas said.


If so, what's the number?


The following contains specific limits:

https://build.whamcloud.com/job/lustre-manual//lastSuccessfulBuild/artifact/lustre_manual.xhtml#idm140436304680016

You'll notice that you must assume some aspects of configuration, such as the
size and number of your OSTs.  I see OSTs in the range of 75-400TB (and OST
counts between 58 and 187).


If not, how much is proper?


Lustre is designed to scale.  So a config with a small number of OSTs,
on very few OSSes doesn't make that much sense.  OSTs are pretty much
expected to be decent-sized RAIDs.  There would be tradeoffs among cost-
efficient disk sizes (maybe 16T today) and RAID overhead (usually N+2),
and how that trades off with bandwidth (HBA and OSS network).


Does mount multiple filesystems  can affect the stability of each file system 
or cause other problems?


My experience is that the main factor in reliability is device count,
rather than how the devices are organized.  For instance, if you
have more OSSes, you may get linearly nicer performance, but 
you also increase your chance of having components crash or fail.


The main reason for separate filesystems is usually that the MDS
(maybe MTD) can be a bottleneck.  But you can scale MDSes, instead.

regards, mark hahn.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] kernel panic's with Lustre 2.7

2018-07-26 Thread Mark Hahn

cep4-fs-MDT-osd: Overflow in tracking declares for index, rb = 4


there is a fixed bug in 2.8 (LU-4045) that may address this
(at least the "Overflow..index" appears there).
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] multiple filesystem in MGS vs folder based ACL ? prons /cons

2018-06-28 Thread Mark Hahn

FYI, Project Quotas exist beginning with Lustre 2.10.0.


but not yet for ZFS configs, right?  sorry, I don't remember whether 
the OP mentioned which underfilesystem they were using...

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] multiple filesystem in MGS vs folder based ACL ? prons /cons

2018-06-28 Thread Mark Hahn

we have different research groups i am thinking to have one filesystem and
beneath it using ACL have project folders .


well, the first approach should be to use the normal Unix mechanism: 
owners and groups.  ACLs are usually treated as a way to make exceptions,

since owner/group will capture most of the correct sharing relations.

after all, there's little harm in seeing lots of names in your /project 
mount.  unless someone botches the permissions, only the right people can

can traverse the trees.


Just curious what re the pros/cons of having multiple filesystem vs single
filesystem with folders ?


scalability of management.  it's not obviously scalable to manage many
separate filesystems, but very easy to manage thousands of groups
on a single filesystem.


any advice ?


unless you have to prevent users from even seeing the existence 
of other users, just use a single filesystem.


Lustre's current/traditional owner-based quota accounting 
is a bit of a drag, but eventually there will be project quotas...


regards, mark hahn
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre as /home directory

2018-02-15 Thread Mark Hahn

My question is, Is it advisable to have /home in Lustre since users data
will be of small files (less than 5MB)?


certainly it works, but is not very pleasant 
for metadata-intensive activity, such as compiling.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Are there any performance hits with the https://access.redhat.com/security/vulnerabilities/speculativeexecution?

2018-01-05 Thread Mark Hahn

Also to what extent would a Lustre system that is essentially a filer be at
risk? It's not running user code and you're not browsing from it...


to be vulnerable, attack code must run on the system.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-24 Thread Mark Hahn

It?s also worth noting that if you have small OSTs it?s much easier to bump
into a full OST situation.   And specifically, if you singly stripe a file
the file size is limited by the size of the OST.


is there enough real-life experience to know whether 
progressive file layout will mitigate this issue?


thanks,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Linux users are not able to access lustre folders

2017-10-20 Thread Mark Hahn

No directory /home/luser6
Logging in with home="/".


perhaps selinux or the wacky new systemd "protecthome" stuff?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre [2.8.0] and the Linux Automounter

2017-06-19 Thread Mark Hahn

We are considering using Lustre [2.8.0] with the Linux automounter.  Is
anyone using this combination successfully?  Are there any caveats?


fwiw, we use static Lustre mounts and then autofs doing bind mounts
if we want to make the exposed namespace tidier.  Lustre client mounts
are inherently heavier-weight than simple local (or NFS) mounts;
I'd definitely try to avoid doing them frequently.

Admittedly, we haven't tried Lustre automounts for years, but IMO automounts
only make sense when the FS is infrequently used.  For us, it would be very
strange if Lustre weren't in use - an idle node, basically.)

regards, 
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca

  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] RobinHood fail on Lustre IEEL v2.5

2017-06-08 Thread Mark Hahn
2017/06/08 16:19:38 [15616/21] FS_Scan | openat failed on 
23/pci-:00:1a.0-usb-0:1.6.1:1.2-event-mouse: Too many levels of symbolic 
links


RBH has strayed into /sys, which is loaded with traps like this...

Currently the filesystem is in production. Can this the main reason why it 
crashes.


no, I think RBH is being told to scan from / (perhaps by default)
rather than being directed to the Lustre mount point.

regards, mark hahn.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] What happens when a file grows

2017-03-30 Thread Mark Hahn

What happens when a file is created without striping and it gets larger
then the available space on the OST that it is on?


ENOSPC.  but I think it's a bit misleading to say "without striping" - 
leaving the filesystem default set at 1 is still a striping choice...


regards, mark hahn.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre on ZFS pooer direct I/O performance

2016-10-14 Thread Mark Hahn
anyway if I force direct I/O, for example using oflag=direct in dd, the write 
performance drop as low as 8MB/sec


with 1MB block size. And each write it's about 120ms latency.


but that's quite a small block size.  do you approach buffered performance
if you write significantly bigger blocks (8-32M)?  presumably you're already
striping across OSTs?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Advice Lustre & ZFS.

2016-09-29 Thread Mark Hahn

Isn't hw raid bounds and leagues faster then any software raid that you're
ever going to be putting in?


certainly not.  choice of hw/sw raid is largely down to taste:
do you want a opaque/proprietary but more hands-off admin workflow,
or do you want direct control of what's happening (failovers, etc).

sw raid has been able to efficiently saturate any reasonable set of disks
for many years.  after all, device-memory IO already goes through the 
processor, and memory bandwidth is "leagues" greater than IO, and Moore
has given us an excess of very fast cores that can each do ~10 GB/s of 
raid6 calculations...


regards, mark hahn.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

2016-04-13 Thread Mark Hahn

We had to use lustre-2.5.3.90 on the MDS servers because of memory leak.

https://jira.hpdd.intel.com/browse/LU-5726


Mark,

If you don?t have the patch for LU-5726, then you should definitely try to get that 
one.  If nothing else, reading through the bug report might be useful.  It details 
some of the MDS OOM problems I had and mentions setting vm.zone_reclaim_mode=0.  It 
also has Robin Humble?s suggestion of setting "options libcfs 
cpu_npartitions=1? (which is something that I started doing as well).


thanks, we'll be trying the LU-5726 patch and cpu_npartitions things.
it's quite a long thread - do I understand correctly that periodic
vm.drop_caches=1 can postpone the issue?  I can replicate the warning
signs mentioned in the thread (growth of Inactive(file) to dizzying 
heights when doing a lot of unlinks).


It seems odd that if this is purely a memory balance problem,
it manifests as a 0xdeadbeef panic, rather than OOM.  While I understand
that the oom-killer path itself need memory to operate, does this 
also imply that some allocation in the kernel or filesystem is not 
checking a return value?


thanks, mark hahn.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

2016-04-12 Thread Mark Hahn

Our problem seems to correlate with an unintentional creation of a tree of 
>500M files.  Some of the crashes we've had since then appeared
to be related to vm.zone_reclaim_mode=1.  We also enabled quotas right after 
the 500M file thing, and were thinking that inconsistent
quota records might cause this sort of crash.


Have you set vm.zone_reclaim_mode=0 yet?  I had an issue with this on my
file system a while back when it was set to 1.


all our existing Lustre MDSes run happily with vm.zone_reclaim_mode=0,
and making this one consistent appears to have resolved a problem
(in which one family of lustre kernel threads would appear to spin,
"perf top" showing nearly all time spent in spinlock_irq.  iirc.)

might your system have had a *lot* of memory?  ours tend to be 
fairly modest (32-64G, dual-socket intel.)


thanks,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

2016-04-12 Thread Mark Hahn

Giving the rest of the back trace of the crash would help for developers
looking at it. 


It's a lot easier to tell what code is involved with the whole trace.


thanks.  I'm sure that's the case, but these oopsen are truncated.
well, one was slightly longer:

BUG: unable to handle kernel paging request at deadbeef
IP: [] iam_container_init+0x18/0x70 [osd_ldiskfs]
PGD 0 
Oops: 0002 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/online
CPU 14 
Modules linked in: osp(U) mdd(U) lfsck(U) lod(U) mdt(U) mgs(U) mgc(U) fsfilt_ldiskfs(U) osd_ldiskfs(U) lquota(U) ldiskfs(U) lustre(U) lov(U) osc(U) mdc(U) fid(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) sha512_generic sha256_generic crc32c_intel libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc mlx4_en ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 iTCO_wdt iTCO_vendor_support serio_raw raid10 i2c_i801 lpc_ich mfd_core ipmi_devintf mlx4_core sg acpi_pad igb dca i2c_algo_bit i2c_core ptp pps_core shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif isci libsas mpt2sas scsi_transport_sas raid_class ahci wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 7768, comm: mdt00_039 Not tainted 2.6.32-431.23.3.el6_lustre.x86_64 #1 
Supermicro SYS-2027R-WRF/X9DRW

by way of straw-grasping, I'll mention two other very frequent messages
we're seeing on the MDS in question:

Lustre: 17673:0:(mdt_xattr.c:465:mdt_reint_setxattr()) covework-MDT: client 
miss to set OBD_MD_FLCTIME when setxattr system.posix_acl_access: [object 
[0x200031f84:0x1cad0:0x0]] [valid 68719476736]

(which seems to be https://jira.hpdd.intel.com/browse/LU-532 and a 
consequence of some of our very old clients.  but not MDS-crash-able.)


LustreError: 22970:0:(tgt_lastrcvd.c:813:tgt_last_rcvd_update()) 
covework-MDT: trying to overwrite bigger transno:on-disk: 197587694105, 
new: 197587694104 replay: 0. see LU-617.

perplexing because the MDS is 2.5.3 and 
https://jira.hpdd.intel.com/browse/LU-617 shows fixed circa 2.2.0/2.1.2.

(and our problem isn't with recovery afaikt.)

thanks!

regards,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] MDS crashing: unable to handle kernel paging request at 00000000deadbeef (iam_container_init+0x18/0x70)

2016-04-12 Thread Mark Hahn

One of our MDSs is crashing with the following:

BUG: unable to handle kernel paging request at deadbeef
IP: [] iam_container_init+0x18/0x70 [osd_ldiskfs]
PGD 0
Oops: 0002 [#1] SMP

The MDS is running 2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64
with about 2k clients ranging from 1.8.8 to 2.6.0

I'd appreciate any comments on where to point fingers: google doesn't
provide anything suggestive about iam_container_init.

Our problem seems to correlate with an unintentional creation of a tree 
of >500M files.  Some of the crashes we've had since then appeared
to be related to vm.zone_reclaim_mode=1.  We also enabled quotas 
right after the 500M file thing, and were thinking that inconsistent

quota records might cause this sort of crash.

But 0xdeadbeef is usually added as a canary for allocation issues;
is it used this way in Lustre?

thanks,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [Lustre-discuss] recovery from multiple disks failure on the same md

2012-05-07 Thread Mark Hahn
> I'd also recommend to start periodic scrubbing: We do this once per month
>with low priority (~5MBPS) with little impact to the users.

yes.  and if you think a rebuild might overstress marginal disks,
throttling via the dev.raid.speed_limit_max sysctl can help.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Failover / reliability using SAD direct-attached storage

2011-07-23 Thread Mark Hahn
> It seems an external fibre
> or SAS raid is needed,

to be precise, a redundant-path SAN is needed.  you could do it with 
commodity disks and Gb, or you can spend almost unlimited amounts on 
gold-plated disks, FC switches, etc.

the range of costs is really quite remarkable, I guess O(100x). 
compare this to cars where even VERY nice production cars are only 
a few times more expensive than the most cost-effective ones.

> as the idea of loosing the file system if one
> node goes down doesn't seem good, even if temporary.

how often do you expect nodes to fail, and why?

regards, mark hahn.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] software raid

2011-03-25 Thread Mark Hahn
choosing sw vs hw raid depends entirely on your systems, pocketbook, taste.
I think there are two edge cases which are pretty unambiguous:

- modern systems have obscene CPU power and memory bandwidth,
at least compared to disks.  and even compared to the embedded
cpu in raid cards.  this means that software raid is very fast
and attractive for at least moderate numbers of disks.  because
disks are so incredibly cheap, it's almost a shame not to use
the 6+ sata ports present on every motherboard, for instance.

- if you need to minimize CPU and memory-bandwidth overheads,
or address very large numbers of disks, you want as much hardware
assist as you can get even though it's expensive and wimpy.
having 100 15k rpm SAS disks as JBOD under SW raid would make
little sense, since the disks, expanders, backplanes and controllers
overwhelm the cost savings.

I think it boils down to your personal weighting of factors in TCO.

"classic" best practices, for instance, emphasizes device reliability
to maintain extreme uptime and minimize admin monkeywork.  that's fine,
but it's completely opposite to the less ideological, more market-reality
driven approach that recognizes disks cost $30/TB and dropping, and that 
with appropriate use of redundancy, mass-market hardware can still achive 
however many nines you set your heart on.

it is convenient that a 2u node supporting 6-12 disks can be done with 
the free/builtin controller with SW raid and delivers bandwidth that matches 
relevant network interfaces (10G, IB).  I like the fact that a single unit
like that has no "extra" firmware to maintain, or over-smart controllers
to go bonkers.  IPMI power control includes the disks.  SMART works directly.
and in a pinch, the content can be brought online via any old PC.

I've used MD since it was new in the kernel, and never had problems with it.

regards, mark hahn
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss