Re: [lustre-discuss] ZFS zpool/filesystem operations while mounted with '-t lustre'

2023-05-19 Thread Peter Grandi via lustre-discuss
>>> On Thu, 18 May 2023 15:13:24 +0100, Peter Grandi via lustre-discuss 
>>>  said:

>> You might want to take a look at this: 
>> https://www.opensfs.org/wp-content/uploads/2017/06/Wed06-CroweTom-lug17-ost_data_migration_using_ZFS.pdf
> I was indeed reading that but I was a bit hesitant because the 
> "zpool"/"zfs" operations are bracketed by 'service lustre stop 
> ...'/'service lustre start ...' commands which I hope to avoid.

So doing a snapshot of the MDT "filesystem" and then doing 'zfs
send' of it just works well. I think that:

* It may be important that the shapshot be read-only

* The snapshotting should be preceded by 'barrier_freeze' and
  followed by 'barrier_thaw'.

* It is also useful to have the option '-s' on 'zfs receive' to
  make the 'zfs send' restartable.

* Probably the snapshot can be done *without* a 'barrier_freeze'
  first, then once it is done one can do 'barrier_freeze', then
  take another snapshot, then 'barrier_thaw' and then send that
  very short incremental, to minimize the time where the Lustre
  instance is frozen.

I will find the time to request an update of the backup sexction
of the Lustre operations manual to document this.

With the ZFS backend improvements in recent years I guess that
ZFS rather than 'ldiskfs' should currently be the default
storage backend for MDTs too.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS zpool/filesystem operations while mounted with '-t lustre'

2023-05-18 Thread Peter Grandi via lustre-discuss

You only need to stop lustre if you are planning to decommission the old 
hardware and switch over entirely to the new hardware.  In that case, stopping 
lustre is needed to ensure that no new content is created on the mdt during the 
final sync.


Thanks, that was my guess, as that that presentation described a full 
migration rather than an incremental backup situation.



If you are just trying to keep an on-going backup of the mdt on another zpool, 
you can just keep doing incremental snapshots and don't have to stop lustre 
(with the knowledge that your backup of the mdt data will be slightly old).


Good guess, that's pretty much what I want to do. BTW with a 
belt-and-braces approach I also mean to 'tar' and 'getfattr' 
(just-in-case) the same to a different media.


BTW I wonder whether I should bother to do the same for the MGT -- on 
one hand it is tiny, on the other my guess is that its state is pretty 
much irrelevant in case of restoring the MDT.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS zpool/filesystem operations while mounted with '-t lustre'

2023-05-18 Thread Peter Grandi via lustre-discuss

You might want to take a look at this: 
https://www.opensfs.org/wp-content/uploads/2017/06/Wed06-CroweTom-lug17-ost_data_migration_using_ZFS.pdf


I was indeed reading that but I was a bit hesitant because the 
"zpool"/"zfs" operations are bracketed by 'service lustre stop 
...'/'service lustre start ...' commands which I hope to avoid.



[...] basically the same procedure to make incremental copies of mdt data when 
we were attempting to switch over to a new mds server.  You don't need to 
unmount lustre in order to do the incremental backups.
Thanks for the confirmation, it is nice to be a bit surer that my 
expectation was realistic.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS zpool/filesystem operations while mounted with '-t lustre'

2023-05-18 Thread Peter Grandi via lustre-discuss
I have a Lustre 2.15.2 instance "temp01" on ZFS 2.1.5 (on EL8), and I 
just want to backup the MDT of the instance (I am mirroring the data on 
two separate "pools" of servers).


The "zpool" is called "temp01_mdt_000" and so is the filesystem, so the 
'/etc/fstab' mount line is (I have set legacy ZFS mount-points):



temp01_mdt_000/temp01_mdt_000 /srv/temp01/temp01_mdt_000 lustre 
defaults,noatime,auto 0 0


As usual '/srv/temp01/temp01_mdt_000' is opaque (no access permissions) 
when mounted as 'lustre', but anyhow I would like to backup a snapshot 
of it, using 'zfs send'.


If I use 'zpool list' and 'zfs list' I see the relevant details even if 
it is mounted as 'lustre'. If I un-mount it and re-mount it as 'zfs' it 
looks ordinary, but I would rather not do that.


In theory, however it is mounted, I should be able to (suitably preceded 
by 'lctl barrier_freeze') create a snapshot for it, and mount that as 
'zfs' to some other mount-point and then 'zfs send' that.


Before doing it on a live Lustre instance (I can't easily afford to 
setup a test Lustre instance in the short run) I would like some 
confirmation that this is meant to work ideally by someone who does it 
routinely :-).


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS Support for Lustre

2023-02-24 Thread Laura Hild via lustre-discuss
Hi Nick-

I see that you are using zfs-dkms instead of kmod-zfs.  When using zfs-dkms, it 
is of particular relevance to your modprobe error whether DKMS actually built 
and installed modules into your running kernel's modules directory.  Does

  dkms status

say it has installed modules for that specific kernel version?  If not, you 
will have to investigate why it didn't.

-Laura
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS Support for Lustre

2023-02-24 Thread Nick dan via lustre-discuss
Hi

I have installed following packages of Lustre and ZFS

*LUSTRE*
[root@1u fs]# rpm -qa | grep lustre
kmod-lustre-tests-debuginfo-2.15.2-1.el8.x86_64
lustre-zfs-dkms-2.15.2-1.el8.noarch
kernel-debuginfo-common-x86_64-4.18.0-425.3.1.el8_lustre.x86_64
kernel-selftests-internal-4.18.0-425.3.1.el8_lustre.x86_64
kernel-core-4.18.0-425.3.1.el8_lustre.x86_64
kernel-devel-4.18.0-425.3.1.el8_lustre.x86_64
lustre-osd-zfs-mount-2.15.2-1.el8.x86_64
lustre-debuginfo-2.15.2-1.el8.x86_64
lustre-tests-debuginfo-2.15.2-1.el8.x86_64
kernel-modules-internal-4.18.0-425.3.1.el8_lustre.x86_64
kernel-debuginfo-4.18.0-425.3.1.el8_lustre.x86_64
lustre-iokit-2.15.2-1.el8.x86_64
kernel-ipaclones-internal-4.18.0-425.3.1.el8_lustre.x86_64
python3-perf-4.18.0-425.3.1.el8_lustre.x86_64
perf-4.18.0-425.3.1.el8_lustre.x86_64
kmod-lustre-osd-zfs-2.15.2-1.el8.x86_64
kmod-lustre-osd-zfs-debuginfo-2.15.2-1.el8.x86_64
kernel-headers-4.18.0-425.3.1.el8_lustre.x86_64
kernel-4.18.0-425.3.1.el8_lustre.x86_64
kmod-lustre-2.15.2-1.el8.x86_64
kernel-modules-extra-4.18.0-425.3.1.el8_lustre.x86_64
kmod-lustre-tests-2.15.2-1.el8.x86_64
lustre-debugsource-2.15.2-1.el8.x86_64
kernel-modules-4.18.0-425.3.1.el8_lustre.x86_64
kmod-lustre-debuginfo-2.15.2-1.el8.x86_64
perf-debuginfo-4.18.0-425.3.1.el8_lustre.x86_64
lustre-osd-ldiskfs-mount-2.15.2-1.el8.x86_64
lustre-osd-zfs-mount-debuginfo-2.15.2-1.el8.x86_64

*ZFS*
lustre-zfs-dkms-2.15.2-1.el8.noarch
libzfs5-devel-2.1.5-1.el8.x86_64
zfs-debuginfo-2.1.5-1.el8.x86_64
lustre-osd-zfs-mount-2.15.2-1.el8.x86_64
zfs-2.1.5-1.el8.x86_64
python3-pyzfs-2.1.5-1.el8.noarch
zfs-test-2.1.5-1.el8.x86_64
kmod-lustre-osd-zfs-2.15.2-1.el8.x86_64
kmod-lustre-osd-zfs-debuginfo-2.15.2-1.el8.x86_64
zfs-debugsource-2.1.5-1.el8.x86_64
libzfs5-debuginfo-2.1.5-1.el8.x86_64
libzfs5-2.1.5-1.el8.x86_64
zfs-dkms-2.1.5-1.el8.noarch
zfs-test-debuginfo-2.1.5-1.el8.x86_64
lustre-osd-zfs-mount-debuginfo-2.15.2-1.el8.x86_64

When I am trying to modprobe zfs this is the error I am getting
I have rebooted the server

[root@1u user]# modprobe zfs
modprobe: FATAL: Module zfs not found in directory
/lib/modules/4.18.0-425.3.1.el8_lustre.x86_64

FYI
[root@1u user]# cd
/lib/modules/4.18.0-425.3.1.el8_lustre.x86_64/extra/lustre-osd-zfs/fs
[root@1u fs]# ls
osd_zfs.ko

Can you help with the error?

Regards
Nick Dan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS Support For Lustre

2023-02-18 Thread Hans Henrik Happe via lustre-discuss

Hi,

The repos, in general, only work for the kernel they were build for. 
This will be the supported kernel by the release. Look in the changelog. 
For 2.15.1:


https://wiki.lustre.org/Lustre_2.15.1_Changelog

To make newer kernels work you have to compile yourself and that might 
not even work without patching and hasn't gone through testing.


You are better of following the supported kernel on the servers. The 
client is usually more likely to compile/work on newer kernels, but 
patching might be needed.


Cheers,
Hans Henrik

On 14.02.2023 13.16, Nick dan via lustre-discuss wrote:

Hi

I am using Lustre Version 2.15.1 on RedHat 8.8
As mentioned in the link, 
https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix , the ZFS 
Version required is 2.1.2.
However, when I am trying to install ZFS from 
https://downloads.whamcloud.com/public/lustre/lustre-2.15.1/el8.6/server/RPMS/x86_64/

I am getting the following error
[root@st01 user]# yum install 
https://downloads.whamcloud.com/public/lustre/lustre-2.15.1/el8.6/server/RPMS/x86_64/zfs-2.1.2-1.el8.x86_64.rpm

Updating Subscription Management repositories.
Last metadata expiration check: 2:01:30 ago on Tue 14 Feb 2023 
03:38:48 PM IST.

zfs-2.1.2-1.el8.x86_64.rpm                  248 kB/s | 649 kB     00:02
Error:
 Problem: conflicting requests
  - nothing provides zfs-kmod = 2.1.2 needed by zfs-2.1.2-1.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages or 
'--nobest' to use not only best candidate packages)


I have installed the other required packages like libzfs, libzpool, 
libnvpair, libutil.


I am not able to download kmod-zfs version 2.1.2, as the latest 
version getting downloaded is 2.1.9


Can you help with this or suggest another way to download all 
supported ZFS Packages?


Thanks,
Nick Dan

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS Support For Lustre

2023-02-15 Thread Laura Hild via lustre-discuss
Hi Nick-

I too have had little success finding pre-built kmod-zfs to pair with Lustre.  
To provide zfs-kmod = 2.1.2 I believe you can use either the zfs-dkms package 
(which also provides `zfs-kmod` despite not being `kmod-zfs`), or build your 
own kmod-zfs as I suggested in one of my previous messages 
(http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2023-January/018441.html).

-Laura

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS Support For Lustre

2023-02-14 Thread Nick dan via lustre-discuss
Hi

I am using Lustre Version 2.15.1 on RedHat 8.8
As mentioned in the link,
https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix , the ZFS
Version required is 2.1.2.
However, when I am trying to install ZFS from
https://downloads.whamcloud.com/public/lustre/lustre-2.15.1/el8.6/server/RPMS/x86_64/
I am getting the following error
[root@st01 user]# yum install
https://downloads.whamcloud.com/public/lustre/lustre-2.15.1/el8.6/server/RPMS/x86_64/zfs-2.1.2-1.el8.x86_64.rpm
Updating Subscription Management repositories.
Last metadata expiration check: 2:01:30 ago on Tue 14 Feb 2023 03:38:48 PM
IST.
zfs-2.1.2-1.el8.x86_64.rpm  248 kB/s | 649 kB 00:02
Error:
 Problem: conflicting requests
  - nothing provides zfs-kmod = 2.1.2 needed by zfs-2.1.2-1.el8.x86_64
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to
use not only best candidate packages)

I have installed the other required packages like libzfs, libzpool,
libnvpair, libutil.

I am not able to download kmod-zfs version 2.1.2, as the latest version
getting downloaded is 2.1.9

Can you help with this or suggest another way to download all supported ZFS
Packages?

Thanks,
Nick Dan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS rpm not getting install.

2023-01-24 Thread Laura Hild via lustre-discuss
Those dependencies are provided by the kmod-zfs package, which is not included 
in the same repository.  It looks like the oldest kmod-zfs provided by the 
OpenZFS project for EL8.6 is 2.1.4, which might work, but the straightforward 
thing to do is probably just to build a kmod-zfs-2.1.2 yourself following the 
instructions at

https://openzfs.github.io/openzfs-docs/Developer%20Resources/Custom%20Packages.html#kabi-tracking-kmod



Od: lustre-discuss  v imenu Nick dan 
via lustre-discuss 
Poslano: torek, 24. januar 2023 07:41
Za: lustre-discuss@lists.lustre.org; lustre-discuss-requ...@lists.lustre.org; 
lustre-discuss-ow...@lists.lustre.org
Zadeva: [EXTERNAL] [lustre-discuss] ZFS rpm not getting install.

Hi,
I'm trying to install kmod-lustre-osd-zfs-2.15.1-1.el8.x86_64.rpm but it is 
giving some dependency error of ksym.
Error attached below.
Can you please help?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS with Lustre

2023-01-03 Thread James Lam via lustre-discuss
https://wiki.lustre.org/Lustre_with_ZFS_Install

This covers most of the topics

Sent from Mail for Windows

From: Nick dan
Sent: Tuesday, January 3, 2023 7:51 PM
To: lustre-discuss@lists.lustre.org; 
lustre-discuss-requ...@lists.lustre.org;
 Degremont, Aurelien; Taner 
KARAGÖL; 
lustre-discuss-ow...@lists.lustre.org;
 i...@whamcloud.com; 
firatyilm...@gmail.com; James 
Lam
Subject: ZFS with Lustre

Hi

What packages are needed to be installed on server and client for using Lustre 
with ZFS?

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS with Lustre

2023-01-03 Thread Nick dan via lustre-discuss
Hi

What packages are needed to be installed on server and client for using
Lustre with ZFS?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS file error of MDT

2022-09-25 Thread Ian Yi-Feng Chang via lustre-discuss
Hi Laura,
Thank you for the feedback.
I'm wondering if I could remove the corrupted file from MDT and clear the
file error. Without the file error, the Lustre storage might be started
again. I understand some files would definitely miss, but at least we have
an opportunity to recover other files back.

Best,
Ian


On Sat, Sep 24, 2022 at 5:35 AM Laura Hild  wrote:

> Hi Ian-
>
> It looks to me like that hardware RAID array is giving ZFS data back that
> is not what ZFS thinks it wrote.  Since from ZFS’ perspective there is no
> redundancy in the pool, only what the RAID array returns, ZFS cannot
> reconstruct the file to its satisfaction, and rather than return data that
> ZFS thinks is corrupt, it is refusing to allow that file to be accessed at
> all.  Lustre, which relies on the lower layers for redundancy, expects the
> file to be accessible, and it’s not.
>
> -Laura
>
>
> 
> Od: lustre-discuss  v imenu Ian
> Yi-Feng Chang via lustre-discuss 
> Poslano: sreda, 21. september 2022 10:53
> Za: Robert Anderson; lustre-discuss@lists.lustre.org
> Zadeva: [EXTERNAL] Re: [lustre-discuss] ZFS file error of MDT
>
> Thanks Robert for the feedback. Actually, I do not know about Lustre at
> all.
> I am also trying to contact the engineer who built the Lustre system for
> more information regarding the drive information.
> To my knowledge, the LustreMDT pool is a 4 SSD disk group (named
> /dev/mapper/SSD) with hardware RAID5.
>
> I can manually mount the LustreMDT/mdt0-work by following steps:
>
> pcs cluster standby --all (Stop MDS and OSS)
> zpool import LustreMDT
> zfs set canmount=on LustreMDT/mdt0-work
> zfs mount LustreMDT/mdt0-work
>
> Then I ls the file /LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0 it
> returned I/O error, but other files look fine.
> [root@mds1 mdt0-work]# ls -ahlt
> "/LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0"
> ls: reading directory /LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0:
> Input/output error
> total 23M
> drwxr-xr-x 2 root root 2 Jan  1  1970 .
> drwxr-xr-x 0 root root 0 Jan  1  1970 ..
>
> Is this the drive failure situation you referring to?
>
> Best,
> Ian
>
>
> On Wed, Sep 21, 2022 at 9:32 PM Robert Anderson  robe...@usnh.edu>> wrote:
> I could be reading your zpool status output wrong, but it looks like you
> had 2 drives in that pool. Not mirrored, so no fault tolerance. Any drive
> failure would lose half of the pool data.
>
> Unless you can get that drive working you are missing half of your data
> and have no resilience to errors, nothing to recover from.
>
> However you proceed you should ensure that have a mirrored zfs pool or
> more drives and raidz (I like raidz2).
>
>
> On September 20, 2022 11:57:09 PM Ian Yi-Feng Chang via lustre-discuss <
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>>
> wrote:
>
> CAUTION: This email originated from outside of the University System. Do
> not click links or open attachments unless you recognize the sender and
> know the content is safe.
>
>
> Dear All,
> I think this problem is more related to ZFS, but I would like to ask for
> help from experts in all fields.
> Our MDT cannot work properly after the IB switch was accidentally rebooted
> (power issue).
> Everything looks good except for the MDT cannot be started.
> Our MDT's ZFS didn't have a backup or snapshot.
> I would like to ask, could this problem be fixed and how to fix?
>
> Thanks for your help in advance.
>
> Best,
> Ian
>
> Lustre: Build Version: 2.10.4
> OS: CentOS Linux release 7.5.1804 (Core)
> uname -r: 3.10.0-862.el7.x86_64
>
>
> [root@mds1 etc]# pcs status
> Cluster name: mdsgroup01
> Stack: corosync
> Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with
> quorum
> Last updated: Wed Sep 21 11:46:25 2022
> Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1
>
> 2 nodes configured
> 9 resources configured
>
> Online: [ mds1 mds2 ]
>
> Full list of resources:
>
>  Resource Group: group-MDS
>  zfs-LustreMDT  (ocf::heartbeat:ZFS):   Started mds1
>  MGT(ocf::lustre:Lustre):   Started mds1
>  MDT(ocf::lustre:Lustre):   Stopped
>  ipmi-fencingMDS1   (stonith:fence_ipmilan):Started mds2
>  ipmi-fencingMDS2   (stonith:fence_ipmilan):Started mds2
>  Clone Set: healthLUSTRE-clone [healthLUSTRE]
>  Started: [ mds1 mds2 ]
>  Clone Set: healthLNET-clone [healthLNET]
>  Started: [ mds1 mds2 ]
>
> Failed Actions:
> * MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete,
> exitreason='',
> last-rc-change='Tue Sep 20 1

Re: [lustre-discuss] ZFS file error of MDT

2022-09-23 Thread Laura Hild via lustre-discuss
Hi Ian-
It looks to me like that hardware RAID array is giving ZFS data back that is 
not what ZFS thinks it wrote.  Since from ZFS’ perspective there is no 
redundancy in the pool, only what the RAID array returns, ZFS cannot 
reconstruct the file to its satisfaction, and rather than return data that ZFS 
thinks is corrupt, it is refusing to allow that file to be accessed at all.  
Lustre, which relies on the lower layers for redundancy, expects the file to be 
accessible, and it’s not.
-Laura


Od: lustre-discuss  v imenu Ian 
Yi-Feng Chang via lustre-discuss 
Poslano: sreda, 21. september 2022 10:53
Za: Robert Anderson; lustre-discuss@lists.lustre.org
Zadeva: [EXTERNAL] Re: [lustre-discuss] ZFS file error of MDT

Thanks Robert for the feedback. Actually, I do not know about Lustre at all.
I am also trying to contact the engineer who built the Lustre system for more 
information regarding the drive information.
To my knowledge, the LustreMDT pool is a 4 SSD disk group (named 
/dev/mapper/SSD) with hardware RAID5.

I can manually mount the LustreMDT/mdt0-work by following steps:

pcs cluster standby --all (Stop MDS and OSS)
zpool import LustreMDT
zfs set canmount=on LustreMDT/mdt0-work
zfs mount LustreMDT/mdt0-work

Then I ls the file /LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0 it returned 
I/O error, but other files look fine.
[root@mds1 mdt0-work]# ls -ahlt "/LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0"
ls: reading directory /LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0: 
Input/output error
total 23M
drwxr-xr-x 2 root root 2 Jan  1  1970 .
drwxr-xr-x 0 root root 0 Jan  1  1970 ..

Is this the drive failure situation you referring to?

Best,
Ian


On Wed, Sep 21, 2022 at 9:32 PM Robert Anderson 
mailto:robe...@usnh.edu>> wrote:
I could be reading your zpool status output wrong, but it looks like you had 2 
drives in that pool. Not mirrored, so no fault tolerance. Any drive failure 
would lose half of the pool data.

Unless you can get that drive working you are missing half of your data and 
have no resilience to errors, nothing to recover from.

However you proceed you should ensure that have a mirrored zfs pool or more 
drives and raidz (I like raidz2).


On September 20, 2022 11:57:09 PM Ian Yi-Feng Chang via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

CAUTION: This email originated from outside of the University System. Do not 
click links or open attachments unless you recognize the sender and know the 
content is safe.


Dear All,
I think this problem is more related to ZFS, but I would like to ask for help 
from experts in all fields.
Our MDT cannot work properly after the IB switch was accidentally rebooted 
(power issue).
Everything looks good except for the MDT cannot be started.
Our MDT's ZFS didn't have a backup or snapshot.
I would like to ask, could this problem be fixed and how to fix?

Thanks for your help in advance.

Best,
Ian

Lustre: Build Version: 2.10.4
OS: CentOS Linux release 7.5.1804 (Core)
uname -r: 3.10.0-862.el7.x86_64


[root@mds1 etc]# pcs status
Cluster name: mdsgroup01
Stack: corosync
Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Wed Sep 21 11:46:25 2022
Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1

2 nodes configured
9 resources configured

Online: [ mds1 mds2 ]

Full list of resources:

 Resource Group: group-MDS
 zfs-LustreMDT  (ocf::heartbeat:ZFS):   Started mds1
 MGT(ocf::lustre:Lustre):   Started mds1
 MDT(ocf::lustre:Lustre):   Stopped
 ipmi-fencingMDS1   (stonith:fence_ipmilan):Started mds2
 ipmi-fencingMDS2   (stonith:fence_ipmilan):Started mds2
 Clone Set: healthLUSTRE-clone [healthLUSTRE]
 Started: [ mds1 mds2 ]
 Clone Set: healthLNET-clone [healthLNET]
 Started: [ mds1 mds2 ]

Failed Actions:
* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete, 
exitreason='',
last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms
* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete, 
exitreason='',
last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



After zpool scrub MDT, the zpool status -v of MDT pool reported:

  pool: LustreMDT
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022
config:

NAMESTATE READ WRITE CKSUM
LustreMDT   ONLINE   0 0 2
  SSD   ONLINE   0 0 8

errors: Permanent errors have been detected 

Re: [lustre-discuss] ZFS file error of MDT

2022-09-21 Thread Ian Yi-Feng Chang via lustre-discuss
Thanks Robert for the feedback. Actually, I do not know about Lustre at
all.
I am also trying to contact the engineer who built the Lustre system for
more information regarding the drive information.
To my knowledge, the LustreMDT pool is a 4 SSD disk group (named
/dev/mapper/SSD) with hardware RAID5.

I can manually mount the LustreMDT/mdt0-work by following steps:

pcs cluster standby --all (Stop MDS and OSS)
zpool import LustreMDT
zfs set canmount=on LustreMDT/mdt0-work
zfs mount LustreMDT/mdt0-work

Then I ls the file /LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0 it
returned I/O error, but other files look fine.
[root@mds1 mdt0-work]# ls -ahlt
"/LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0"
ls: reading directory /LustreMDT/mdt0-work/oi.3/0x20003:0x2:0x0:
Input/output error
total 23M
drwxr-xr-x 2 root root 2 Jan  1  1970 .
drwxr-xr-x 0 root root 0 Jan  1  1970 ..

Is this the drive failure situation you referring to?

Best,
Ian


On Wed, Sep 21, 2022 at 9:32 PM Robert Anderson  wrote:

> I could be reading your zpool status output wrong, but it looks like you
> had 2 drives in that pool. Not mirrored, so no fault tolerance. Any drive
> failure would lose half of the pool data.
>
> Unless you can get that drive working you are missing half of your data
> and have no resilience to errors, nothing to recover from.
>
> However you proceed you should ensure that have a mirrored zfs pool or
> more drives and raidz (I like raidz2).
>
> On September 20, 2022 11:57:09 PM Ian Yi-Feng Chang via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
>> CAUTION: This email originated from outside of the University System. Do
>> not click links or open attachments unless you recognize the sender and
>> know the content is safe.
>>
>> Dear All,
>> I think this problem is more related to ZFS, but I would like to ask
>> for help from experts in all fields.
>> Our MDT cannot work properly after the IB
>> switch was accidentally rebooted (power issue).
>> Everything looks good except for the MDT cannot be started.
>> Our MDT's ZFS didn't have a backup or snapshot.
>> I would like to ask, could this problem be fixed and how to fix?
>>
>> Thanks for your help in advance.
>>
>> Best,
>> Ian
>>
>> Lustre: Build Version: 2.10.4
>> OS: CentOS Linux release 7.5.1804 (Core)
>> uname -r: 3.10.0-862.el7.x86_64
>>
>>
>> [root@mds1 etc]# pcs status
>> Cluster name: mdsgroup01
>> Stack: corosync
>> Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with
>> quorum
>> Last updated: Wed Sep 21 11:46:25 2022
>> Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1
>>
>> 2 nodes configured
>> 9 resources configured
>>
>> Online: [ mds1 mds2 ]
>>
>> Full list of resources:
>>
>>  Resource Group: group-MDS
>>  zfs-LustreMDT  (ocf::heartbeat:ZFS):   Started mds1
>>  MGT(ocf::lustre:Lustre):   Started mds1
>>  MDT(ocf::lustre:Lustre):   Stopped
>>  ipmi-fencingMDS1   (stonith:fence_ipmilan):Started mds2
>>  ipmi-fencingMDS2   (stonith:fence_ipmilan):Started mds2
>>  Clone Set: healthLUSTRE-clone [healthLUSTRE]
>>  Started: [ mds1 mds2 ]
>>  Clone Set: healthLNET-clone [healthLNET]
>>  Started: [ mds1 mds2 ]
>>
>> Failed Actions:
>> * MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete,
>> exitreason='',
>> last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms
>> * MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete,
>> exitreason='',
>> last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms
>>
>>
>> Daemon Status:
>>   corosync: active/enabled
>>   pacemaker: active/enabled
>>   pcsd: active/enabled
>>
>>
>>
>> After zpool scrub MDT, the zpool status -v of MDT pool reported:
>>
>>   pool: LustreMDT
>>  state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore the
>> entire pool from backup.
>>see: http://zfsonlinux.org/msg/ZFS-8000-8A
>> 
>>   scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24
>> 2022
>> config:
>>
>> NAMESTATE READ WRITE CKSUM
>> LustreMDT   ONLINE   0 0 2
>>   SSD   ONLINE   0 0 8
>>
>> errors: Permanent errors have been detected in the following files:
>>
>> LustreMDT/mdt0-work:/oi.3/0x20003:0x2:0x0
>>
>>
>>
>> # dmesg -T
>> [Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4
>> [Tue Sep 20 

[lustre-discuss] ZFS file error of MDT

2022-09-20 Thread Ian Yi-Feng Chang via lustre-discuss
Dear All,
I think this problem is more related to ZFS, but I would like to ask
for help from experts in all fields.
Our MDT cannot work properly after the IB switch was accidentally rebooted
(power issue).
Everything looks good except for the MDT cannot be started.
Our MDT's ZFS didn't have a backup or snapshot.
I would like to ask, could this problem be fixed and how to fix?

Thanks for your help in advance.

Best,
Ian

Lustre: Build Version: 2.10.4
OS: CentOS Linux release 7.5.1804 (Core)
uname -r: 3.10.0-862.el7.x86_64


[root@mds1 etc]# pcs status
Cluster name: mdsgroup01
Stack: corosync
Current DC: mds1 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with
quorum
Last updated: Wed Sep 21 11:46:25 2022
Last change: Wed Sep 21 11:46:13 2022 by root via cibadmin on mds1

2 nodes configured
9 resources configured

Online: [ mds1 mds2 ]

Full list of resources:

 Resource Group: group-MDS
 zfs-LustreMDT  (ocf::heartbeat:ZFS):   Started mds1
 MGT(ocf::lustre:Lustre):   Started mds1
 MDT(ocf::lustre:Lustre):   Stopped
 ipmi-fencingMDS1   (stonith:fence_ipmilan):Started mds2
 ipmi-fencingMDS2   (stonith:fence_ipmilan):Started mds2
 Clone Set: healthLUSTRE-clone [healthLUSTRE]
 Started: [ mds1 mds2 ]
 Clone Set: healthLNET-clone [healthLNET]
 Started: [ mds1 mds2 ]

Failed Actions:
* MDT_start_0 on mds1 'unknown error' (1): call=44, status=complete,
exitreason='',
last-rc-change='Tue Sep 20 15:01:51 2022', queued=0ms, exec=317ms
* MDT_start_0 on mds2 'unknown error' (1): call=48, status=complete,
exitreason='',
last-rc-change='Tue Sep 20 14:38:18 2022', queued=0ms, exec=25168ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled



After zpool scrub MDT, the zpool status -v of MDT pool reported:

  pool: LustreMDT
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 0h35m with 1 errors on Wed Sep 21 09:38:24 2022
config:

NAMESTATE READ WRITE CKSUM
LustreMDT   ONLINE   0 0 2
  SSD   ONLINE   0 0 8

errors: Permanent errors have been detected in the following files:

LustreMDT/mdt0-work:/oi.3/0x20003:0x2:0x0



# dmesg -T
[Tue Sep 20 15:01:43 2022] Lustre: Lustre: Build Version: 2.10.4
[Tue Sep 20 15:01:43 2022] LNet: Using FMR for registration
[Tue Sep 20 15:01:43 2022] LNet: Added LNI 172.29.32.21@o2ib [8/256/0/180]
[Tue Sep 20 15:01:50 2022] Lustre: MGS: Connection restored to
b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo)
[Tue Sep 20 15:01:50 2022] LustreError: 3839:0:(llog.c:1296:llog_backup())
MGC172.29.32.21@o2ib: failed to open log work-MDT: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC172.29.32.21@o2ib:
failed to copy remote log work-MDT: rc = -5
[Tue Sep 20 15:01:50 2022] LustreError: 13a-8: Failed to get MGS log
work-MDT and no local copy.
[Tue Sep 20 15:01:50 2022] LustreError: 15c-8: MGC172.29.32.21@o2ib: The
configuration from log 'work-MDT' failed (-2). This may be the result
of communication errors between this node and the MGS, a bad configuration,
or other errors. See the syslog for more information.
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1386:server_start_targets()) failed to start
server work-MDT: -2
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start
targets: -2
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount_server.c:1589:server_put_super()) no obd work-MDT
[Tue Sep 20 15:01:50 2022] Lustre: server umount work-MDT complete
[Tue Sep 20 15:01:50 2022] LustreError:
3839:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-2)
[Tue Sep 20 15:01:56 2022] Lustre:
4112:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has
timed out for slow reply: [sent 1663657311/real 1663657311]
 req@8d6f0e728000 x1744471122247856/t0(0) o251->MGC172.29.32.21@o2ib
@0@lo:26/25 lens 224/224 e 0 to 1 dl 1663657317 ref 2 fl Rpc:XN/0/
rc 0/-1
[Tue Sep 20 15:01:56 2022] Lustre: server umount MGS complete
[Tue Sep 20 15:02:29 2022] Lustre: MGS: Connection restored to
b5823059-e620-64ac-79f6-e5282f2fa442 (at 0@lo)
[Tue Sep 20 15:02:54 2022] Lustre: MGS: Connection restored to
28ec81ea-0d51-d721-7be2-4f557da2546d (at 172.29.32.1@o2ib)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS MDT Corruption

2022-09-18 Thread Scott Ruffner via lustre-discuss
On Fri, Sep 16, 2022 at 4:25 PM Christian Kuntz 
wrote:

> Oof! That's not a good situation to be in. Unfortunately, I've hit the
> dual import situation before as well, and as far as I know once you have
> two nodes import a pool at the same time you're more or less hosed.
>

Many hours later, I'm now coming to that conclusion.

When it happened to me, I tried using zdb to read all the recent TXGs to
> try to back track the pool to a previously working state, but unfortunately
> none of it worked, I think I tried 30 in all. You could try that route,
> maybe you'll be luckier than I.
>

I have tried using zdb to find TXG to roll back to - on that stage now.

Now might be the time to dust off any remote backups you have or reach out
> to ZFS recovery specialists. Additionally, _always_ enable `zpool set
> multihost=on ` for any pool that can be imported by more than one
> node for this reason. You can ignore hostid checking safely with `zpool
> import -f`, but without multihost set to on you have no protection against
> simultaneous imports.
>

Sadly, there are no backups or snapshots - the system was intended as
ephemeral /scratch storage, so we just don't have that.

For rollback, look into the `-X` and `-T` pool import options. The man page
> for `zdb` should be able to answer most of your questions. Otherwise, a
> common actor in the ZFS recovery scene is https://www.ufsexplorer.com/ (or
> at least as far as I've seen).
>

I've tried a few, however, this is the MDT for a lustre filesystem, so I
can't really roll back very far without introducing corruption into the
Lustre system...so...yeah.

Thanks for responding. I'm talking to the ufs explorer people, it's worth a
single system copy of their Pro product to see if it performs a miracle.

Thanks!

Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS MDT Corruption

2022-09-16 Thread Christian Kuntz
Oof! That's not a good situation to be in. Unfortunately, I've hit the dual
import situation before as well, and as far as I know once you have two
nodes import a pool at the same time you're more or less hosed.

When it happened to me, I tried using zdb to read all the recent TXGs to
try to back track the pool to a previously working state, but unfortunately
none of it worked, I think I tried 30 in all. You could try that route,
maybe you'll be luckier than I.

Now might be the time to dust off any remote backups you have or reach out
to ZFS recovery specialists. Additionally, _always_ enable `zpool set
multihost=on ` for any pool that can be imported by more than one
node for this reason. You can ignore hostid checking safely with `zpool
import -f`, but without multihost set to on you have no protection against
simultaneous imports.

For rollback, look into the `-X` and `-T` pool import options. The man page
for `zdb` should be able to answer most of your questions. Otherwise, a
common actor in the ZFS recovery scene is https://www.ufsexplorer.com/ (or
at least as far as I've seen).

Sorry for the bad news :(
Christian
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS MDT Corruption

2022-09-15 Thread Scott Ruffner via lustre-discuss
Hi Everyone,

This is more of a ZFS than a Lustre question, but our Lustre cluster MDT HA
pair got into a split-brain condition with the ZFS zpool for the MDT. Upon
examining the situation, both HA pairs (corosync and pacemaker) had the MDT
zpool imported. A manual export from the node which was failing over
appeared initially to resolve the issue, but the 2nd node still failed to
mount the pool due to errors (despite having it imported).

Now corruption is reported on all the mirror VDEVs which make up the MDT
pool (GPT pool is fine on the same two nodes).

If I have a node up without its hostid configured, the mirror devs are
reported as healthy, but I'm unable to zfs import, even trying to override
with the -o multihost=no.

I actually suspect that the data is intact and not corrupted, but the "last
mounted" data is bad, and both systems believe the other still has it
mounted due to the metadata.

I'm stumped with getting the MDT pool re-imported on any node, but I may be
missing something.

Scott Ruffner
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS wobble

2022-04-28 Thread Simon Guilbault
Hi,

Start a ZFS scrub on your pool, this will ensure that all the content is
fine since the short resilver when re-adding dead disks to a pool does not
check everything, only what changed on the pool while that disk was gone.

I sadly often see that kind of error on my personal NAS due to some bad
hardware but ZFS is always able to fix everything even if it
detects "permanent errors" and those permanent errors disappear after the
scrub.

On Thu, Apr 28, 2022 at 4:10 AM Alastair Basden via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> We have OSDs on ZFS (0.7.9) / Lustre 2.12.6.
>
> Recently, one of our JBODs had a wobble, and the disks (as presented to
> the OS) disappeared for a few seconds (and then returned).
>
> This upset a few zpools which SUSPENDED.
>
> A zpool clear on these then started the resilvering process, and zpool
> status gave e.g.:
> errors: Permanent errors have been detected in the following files:
>
>  :<0x0>
>  :<0xb01>
>  :<0x15>
>  :<0x383>
>  cos6-ost7/ost7:/O/40400/d11/10617643
>  cos6-ost7/ost7:/O/40400/d21/583029
>
>
> However, once the resilvering had completed, these permanent errors had
> gone.
>
> The question is then, are these errors really permanent, or was zfs able
> to correct them?
>
> Lustre continues to remain fine (though obviously froze while the pools
> were suspended).
>
> Should we be worried that there might be some under-the-hood corruption
> that will present itself when we need to remount (e.g. after a reboot) the
> OST?  In particular the :<0x0> file worries me a bit!
>
> Thanks,
> Alastair.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS wobble

2022-04-28 Thread Alastair Basden via lustre-discuss

Hi,

We have OSDs on ZFS (0.7.9) / Lustre 2.12.6.

Recently, one of our JBODs had a wobble, and the disks (as presented to 
the OS) disappeared for a few seconds (and then returned).


This upset a few zpools which SUSPENDED.

A zpool clear on these then started the resilvering process, and zpool 
status gave e.g.:

errors: Permanent errors have been detected in the following files:

:<0x0>
:<0xb01>
:<0x15>
:<0x383>
cos6-ost7/ost7:/O/40400/d11/10617643
cos6-ost7/ost7:/O/40400/d21/583029


However, once the resilvering had completed, these permanent errors had 
gone.


The question is then, are these errors really permanent, or was zfs able 
to correct them?


Lustre continues to remain fine (though obviously froze while the pools 
were suspended).


Should we be worried that there might be some under-the-hood corruption 
that will present itself when we need to remount (e.g. after a reboot) the 
OST?  In particular the :<0x0> file worries me a bit!


Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS OST on-disk structure

2021-08-05 Thread Thomas Roth



Hi all,

is there somewhere an explanation of the internal structure/content of an OST, 
in particular a ZFS-based one?

Currently, we are decomissioning some old servers. Having emptied the OSTs of all 'findable' files, there are still  differences in what "df" tells me 
about the OSTs. So  I umounted and remounted the OSTs as local ZFS filesets. Indeed there were some few OST objects that obviously had been lost from 
the MDTs. Nothing is missed by the users, so that's fine.

However, while most of the OSTs are "empty" at about 80 MB, I have a few that 
show 200 MB, but no orphaned objects inside.
Searching for such objects, I noticed that e.g. directories d0...d3x in /O/0/ seem to be about 3M in size, with no visible content, while the 
corresponding directories on an 80MB-OST are ~600K.


So I am curious about these interna of OSTs. I have already learned that the famous LAST_ID file does exist on ZFS but is not visible either - could 
there be more hidden stuff like that here?


Best regards,
Thomas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS and OST Space Difference

2021-04-06 Thread Saravanaraj Ayyampalayam via lustre-discuss
I think you are correct. ‘zpool list’ shows raw space, ‘zfs list’ shows the 
space after reservation for parity, etc.. In a 10 disk raidz2 ~24% of the space 
is reserved for parity.
This website helps in calculating ZFS capacity. 
https://wintelguy.com/zfs-calc.pl 

-Raj

> On Apr 6, 2021, at 4:56 PM, Laura Hild via lustre-discuss 
>  wrote:
> 
> > I am not sure about the discrepancy of 3T.  Maybe that is due to some ZFS 
> > and/or Lustre overhead?
> 
> Slop space?
> 
>
> https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#spa-slop-shift
>  
> 
> 
> -Laura
> 
> 
> Od: lustre-discuss  > v imenu Mohr, Rick via 
> lustre-discuss  >
> Poslano: torek, 06. april 2021 16:34
> Za: Makia Minich  >; lustre-discuss@lists.lustre.org 
>   >
> Zadeva: Re: [lustre-discuss] [EXTERNAL] ZFS and OST Space Difference
>  
> Makia,
> 
> The drive sizes are 7.6 TB which translates to about 6.9 TiB (which is the 
> unit that zpool uses for "T").  So the zpool sizes as just 10 x 6.9T = 69T 
> since zpool shows the total amount of disk space available to the pool.  The 
> usable space (which is what df is reporting) should be more like 0.8 x 69T = 
> 55T.  I am not sure about the discrepancy of 3T.  Maybe that is due to some 
> ZFS and/or Lustre overhead?
> 
> --Rick
> 
> On 4/6/21, 3:49 PM, "lustre-discuss on behalf of Makia Minich" 
>  ma...@systemfabricworks.com> wrote:
> 
> I believe this was discussed a while ago, but I was unable to find clear 
> answers, so I’ll re-ask in hopefully a slightly different way.
> On an OST, I have 30 drives, each at 7.6TB. I create 3 raidz2 zpools of 
> 10 devices (ashift=12):
> 
> [root@lustre47b ~]# zpool list
> NAMESIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAGCAP  
> DEDUPHEALTH  ALTROOT
> oss55-0  69.9T  37.3M  69.9T- - 0% 0%  1.00x
> ONLINE  -
> oss55-1  69.9T  37.3M  69.9T- - 0% 0%  1.00x
> ONLINE  -
> oss55-2  69.9T  37.4M  69.9T- - 0% 0%  1.00x
> ONLINE  -
> [root@lustre47b ~]#
> 
> 
> Running a mkfs.lustre against these (and the lustre mount) and I see:
> 
> [root@lustre47b ~]# df -h | grep ost
> oss55-0/ost165 52T   27M   52T   1% /lustre/ost165
> oss55-1/ost166 52T   27M   52T   1% /lustre/ost166
> oss55-2/ost167 52T   27M   52T   1% /lustre/ost167
> [root@lustre47b ~]#
> 
> 
> Basically, we’re seeing a pretty dramatic loss in capacity (156TB vs 
> 209.7TB, so a loss of about 50TB). Is there any insight on where this 
> capacity is disappearing to? If there some mkfs.lustre or zpool option I 
> missed in creating this? Is something just reporting slightly off and that 
> space really is there?
> 
> Thanks.
> 
> —
> 
> 
> Makia Minich
> 
> Chief Architect
> 
> System Fabric Works
> "Fabric Computing that Works”
> 
> "Oh, I don't know. I think everything is just as it should be, y'know?”
> - Frank Fairfield
> 
> 
> 
> 
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=CJqEzB1piLOyyvZjb8YUQw=897kjkV-MEeU1IVizIfc5Q=habzcIRCKUXYLTbJVvgv2fPgmEuBnVtUdsgTfIsAHZY=M7RWFzL5Xm7uDovhMY_cI9Hvk-jWavZyfLWjpMSAs1E=
>  
> 
>  
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org 
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org 
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS and OST Space Difference

2021-04-06 Thread Laura Hild via lustre-discuss
> I am not sure about the discrepancy of 3T.  Maybe that is due to some ZFS 
> and/or Lustre overhead?

Slop space?

   
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#spa-slop-shift

-Laura



Od: lustre-discuss  v imenu Mohr, Rick 
via lustre-discuss 
Poslano: torek, 06. april 2021 16:34
Za: Makia Minich ; lustre-discuss@lists.lustre.org 

Zadeva: Re: [lustre-discuss] [EXTERNAL] ZFS and OST Space Difference

Makia,

The drive sizes are 7.6 TB which translates to about 6.9 TiB (which is the unit 
that zpool uses for "T").  So the zpool sizes as just 10 x 6.9T = 69T since 
zpool shows the total amount of disk space available to the pool.  The usable 
space (which is what df is reporting) should be more like 0.8 x 69T = 55T.  I 
am not sure about the discrepancy of 3T.  Maybe that is due to some ZFS and/or 
Lustre overhead?

--Rick

On 4/6/21, 3:49 PM, "lustre-discuss on behalf of Makia Minich" 
 wrote:

I believe this was discussed a while ago, but I was unable to find clear 
answers, so I’ll re-ask in hopefully a slightly different way.
On an OST, I have 30 drives, each at 7.6TB. I create 3 raidz2 zpools of 10 
devices (ashift=12):

[root@lustre47b ~]# zpool list
NAMESIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAGCAP  DEDUP  
  HEALTH  ALTROOT
oss55-0  69.9T  37.3M  69.9T- - 0% 0%  1.00x
ONLINE  -
oss55-1  69.9T  37.3M  69.9T- - 0% 0%  1.00x
ONLINE  -
oss55-2  69.9T  37.4M  69.9T- - 0% 0%  1.00x
ONLINE  -
[root@lustre47b ~]#


Running a mkfs.lustre against these (and the lustre mount) and I see:

[root@lustre47b ~]# df -h | grep ost
oss55-0/ost165 52T   27M   52T   1% /lustre/ost165
oss55-1/ost166 52T   27M   52T   1% /lustre/ost166
oss55-2/ost167 52T   27M   52T   1% /lustre/ost167
[root@lustre47b ~]#


Basically, we’re seeing a pretty dramatic loss in capacity (156TB vs 
209.7TB, so a loss of about 50TB). Is there any insight on where this capacity 
is disappearing to? If there some mkfs.lustre or zpool option I missed in 
creating this? Is something just reporting slightly off and that space really 
is there?

Thanks.

—


Makia Minich

Chief Architect

System Fabric Works
"Fabric Computing that Works”

"Oh, I don't know. I think everything is just as it should be, y'know?”
- Frank Fairfield







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwIGaQ=CJqEzB1piLOyyvZjb8YUQw=897kjkV-MEeU1IVizIfc5Q=habzcIRCKUXYLTbJVvgv2fPgmEuBnVtUdsgTfIsAHZY=M7RWFzL5Xm7uDovhMY_cI9Hvk-jWavZyfLWjpMSAs1E=
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS and OST Space Difference

2021-04-06 Thread Makia Minich
I believe this was discussed a while ago, but I was unable to find clear 
answers, so I’ll re-ask in hopefully a slightly different way.

On an OST, I have 30 drives, each at 7.6TB. I create 3 raidz2 zpools of 10 
devices (ashift=12):

[root@lustre47b ~]# zpool list
NAMESIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAGCAP  DEDUP
HEALTH  ALTROOT
oss55-0  69.9T  37.3M  69.9T- - 0% 0%  1.00xONLINE  
-
oss55-1  69.9T  37.3M  69.9T- - 0% 0%  1.00xONLINE  
-
oss55-2  69.9T  37.4M  69.9T- - 0% 0%  1.00xONLINE  
-
[root@lustre47b ~]#

Running a mkfs.lustre against these (and the lustre mount) and I see:

[root@lustre47b ~]# df -h | grep ost
oss55-0/ost165 52T   27M   52T   1% /lustre/ost165
oss55-1/ost166 52T   27M   52T   1% /lustre/ost166
oss55-2/ost167 52T   27M   52T   1% /lustre/ost167
[root@lustre47b ~]#

Basically, we’re seeing a pretty dramatic loss in capacity (156TB vs 209.7TB, 
so a loss of about 50TB). Is there any insight on where this capacity is 
disappearing to? If there some mkfs.lustre or zpool option I missed in creating 
this? Is something just reporting slightly off and that space really is there?

Thanks.

—

Makia Minich
Chief Architect
System Fabric Works
"Fabric Computing that Works”

"Oh, I don't know. I think everything is just as it should be, y'know?”
- Frank Fairfield

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs

2020-12-21 Thread Jeff Johnson
I was just popping a big bowl of popcorn for this... ;-D

On Mon, Dec 21, 2020 at 6:59 AM Peter Jones  wrote:

> Just in case anyone was wondering – the poster never did reach out to me
> so this does seem to be more of a case of phishing/trolling rather than
> someone being genuinely confused.
>
>
>
> *From: *Peter Jones 
> *Date: *Monday, December 14, 2020 at 5:03 AM
> *To: *Samantha Smith , "
> lustre-discuss@lists.lustre.org" 
> *Subject: *Re: [lustre-discuss] zfs
>
>
>
> Sam
>
>
>
> Welcome to the list. This is surprising for a number of reasons. Could you
> please reach out to me directly from your corporate account (rather than
> gmail) and I’ll be happy to work this through with you.
>
>
>
> Thanks
>
>
>
> Peter
>
>
>
> *From: *lustre-discuss  on
> behalf of Samantha Smith 
> *Date: *Sunday, December 13, 2020 at 5:35 PM
> *To: *"lustre-discuss@lists.lustre.org" 
> *Subject: *[lustre-discuss] zfs
>
>
>
> Our team received a demand letter from an Oracle attorney claiming patent
> violations on zfs used in our DDN lustre cluster.
>
>
>
> We called our DDN sales person who gave us a non-answer and has refused to
> call us back.
>
>
>
> How are other people dealing with this?
>
>
>
> sam
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs

2020-12-21 Thread Peter Jones
Just in case anyone was wondering – the poster never did reach out to me so 
this does seem to be more of a case of phishing/trolling rather than someone 
being genuinely confused.

From: Peter Jones 
Date: Monday, December 14, 2020 at 5:03 AM
To: Samantha Smith , 
"lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] zfs

Sam

Welcome to the list. This is surprising for a number of reasons. Could you 
please reach out to me directly from your corporate account (rather than gmail) 
and I’ll be happy to work this through with you.

Thanks

Peter

From: lustre-discuss  on behalf of 
Samantha Smith 
Date: Sunday, December 13, 2020 at 5:35 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] zfs

Our team received a demand letter from an Oracle attorney claiming patent 
violations on zfs used in our DDN lustre cluster.

We called our DDN sales person who gave us a non-answer and has refused to call 
us back.

How are other people dealing with this?

sam

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs

2020-12-17 Thread Marc DiZoglio
Hi Sam,


We have no trace of any such activity or calls and there has been no reply 
which leads us to believe as we know first hand that this is anything other 
than spam being sent out to our community.


Thanks,


Marc DiZoglio
mdizog...@raidinc.com
603-275-6672
www.raidinc.com



From: lustre-discuss  on behalf of 
Samantha Smith 
Sent: Sunday, December 13, 2020 8:34 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] zfs

Our team received a demand letter from an Oracle attorney claiming patent 
violations on zfs used in our DDN lustre cluster.

We called our DDN sales person who gave us a non-answer and has refused to call 
us back.

How are other people dealing with this?

sam

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs

2020-12-14 Thread Peter Jones
Sam

Welcome to the list. This is surprising for a number of reasons. Could you 
please reach out to me directly from your corporate account (rather than gmail) 
and I’ll be happy to work this through with you.

Thanks

Peter

From: lustre-discuss  on behalf of 
Samantha Smith 
Date: Sunday, December 13, 2020 at 5:35 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] zfs

Our team received a demand letter from an Oracle attorney claiming patent 
violations on zfs used in our DDN lustre cluster.

We called our DDN sales person who gave us a non-answer and has refused to call 
us back.

How are other people dealing with this?

sam

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs

2020-12-14 Thread Peter Kjellström
On Mon, 14 Dec 2020 10:17:07 +0100
Pascal Suter  wrote:

> are you sure this is a legitimate letter and not just some scammer?

+1 troll warning.

That email address has never posted here before, and searching wider
also gives me zero hits. Nicely vague and trolly too.

/Peter K

> One would expect that such a letter would cause an immediate
> shitstorm, and so far googling for "zfs oracle patent" only reveals
> some old news regarding the netapp vs oracle fight which ended in
> september this year [1].
> 
> [1]  https://cdrdv2.intel.com/v1/dl/getContent/630393
> 
> cheers
> 
> Pascal
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs

2020-12-14 Thread Pascal Suter
are you sure this is a legitimate letter and not just some scammer? One 
would expect that such a letter would cause an immediate shitstorm, and 
so far googling for "zfs oracle patent" only reveals some old news 
regarding the netapp vs oracle fight which ended in september this year 
[1].


[1]  https://cdrdv2.intel.com/v1/dl/getContent/630393

cheers

Pascal

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] zfs

2020-12-13 Thread Samantha Smith
Our team received a demand letter from an Oracle attorney claiming patent
violations on zfs used in our DDN lustre cluster.

We called our DDN sales person who gave us a non-answer and has refused to
call us back.

How are other people dealing with this?

sam
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS w/Lustre problem

2020-11-19 Thread Steve Thompson

On Wed, 18 Nov 2020, Faaland, Olaf P. wrote:

You mentioned you may have a fix for zfs_send.c in ZFS.  Although Lustre 
tickles the bug, it's not likely that is the only way to tickle it.


Is there already a bug report for your issue at 
https://github.com/openzfs/zfs/issues?  If not, can you create one, even 
if your patch isn't successful?  That's the place to get your patch 
landed, and/or get help with the issue.


The issue is already reported as #8067. The patch mentioned in this 
article is still valid, but for a different file; the txg_wait_synced()
call is at line 2215 of dmu_send.c on ZFS 0.7.13. This patch does not fix 
that very slow performance of 'zfs recv', but it does fix the 'dataset 
does not exist' problem.


Steve
--

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
  "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS w/Lustre problem

2020-11-18 Thread Faaland, Olaf P.
Hi Steve,

You mentioned you may have a fix for zfs_send.c in ZFS.   Although Lustre 
tickles the bug, it's not likely that is the only way to tickle it.

Is there already a bug report for your issue at 
https://github.com/openzfs/zfs/issues?  If not, can you create one, even if 
your patch isn't successful?  That's the place to get your patch landed, and/or 
get help with the issue.

thanks,
-Olaf


From: lustre-discuss  on behalf of 
Steve Thompson 
Sent: Tuesday, November 10, 2020 5:06 AM
To: Hans Henrik Happe
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS w/Lustre problem

On Mon, 9 Nov 2020, Hans Henrik Happe wrote:

> I sounds like this issue, but I'm not sure what your dnodesize is:
>
> https://github.com/openzfs/zfs/issues/8458
>
> ZFS 0.8.1+ on the receiving side should fix it. Then again ZFS 0.8 is
> not supported in Lustre 2.12, so it's a bit hard to restore, without
> copying the underlying devices.

Hans Henrik,

Many thanks for your input. I had in fact known about the dnodesize issue,
and tested a workaround. Unfortunately, it turned out not to be this.
Instead, I have tested a patch to zfs_send.c, which does appear to have
solved the issue. The zfs send/recv is still running, however; if it
completes successfully, I will post again with details of the patch.

Steve
--

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
   "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS w/Lustre problem

2020-11-10 Thread Steve Thompson

On Mon, 9 Nov 2020, Hans Henrik Happe wrote:


I sounds like this issue, but I'm not sure what your dnodesize is:

https://github.com/openzfs/zfs/issues/8458

ZFS 0.8.1+ on the receiving side should fix it. Then again ZFS 0.8 is
not supported in Lustre 2.12, so it's a bit hard to restore, without
copying the underlying devices.


Hans Henrik,

Many thanks for your input. I had in fact known about the dnodesize issue, 
and tested a workaround. Unfortunately, it turned out not to be this. 
Instead, I have tested a patch to zfs_send.c, which does appear to have 
solved the issue. The zfs send/recv is still running, however; if it 
completes successfully, I will post again with details of the patch.


Steve
--

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
  "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS w/Lustre problem

2020-11-09 Thread Hans Henrik Happe
I sounds like this issue, but I'm not sure what your dnodesize is:

https://github.com/openzfs/zfs/issues/8458

ZFS 0.8.1+ on the receiving side should fix it. Then again ZFS 0.8 is
not supported in Lustre 2.12, so it's a bit hard to restore, without
copying the underlying devices.

Cheers,
Hans Henrik

On 06.11.2020 21.23, Steve Thompson wrote:
> This may be a question for the ZFS list...
>
> I have Lustre 2.12.5 on Centos 7.8 with ZFS 0.7.13, 10GB network. I
> make snapshots of the Lustre filesystem with 'lctl snapshot_create'
> and at a later time transfer these snapshots to a backup system with
> zfs send/recv. This works well for everything but the MDT. For the
> MDT, I find that the zfs recv always fails when a little less than 1GB
> has been transferred (this being an incremental send/recv of snapshots
> taken a day apart):
>
> # zfs send -v -c -i fs0pool/mdt0@03-nov-2020 fs0pool/mdt0@04-nov-2020 | \
> zfs recv -F backups/fs0pool/mdt0
> 
> 12:11:18    946M   fs0pool/mdt0@04-nov-2020-01:00
> 12:11:19    946M   fs0pool/mdt0@04-nov-2020-01:00
> 12:11:20    946M   fs0pool/mdt0@04-nov-2020-01:00
> cannot receive incremental stream: dataset does not exist
>
> while if the data transfer is much smaller, the send/recv works. Since
> once I get a failure it is not possible to complete a send/recv for
> any subsequent day, I am doing a full snapshot send to a file; this
> always works and takes about 5/6 minutes for my MDT. When using zfs
> send/recv, the recv is always very very slow (several hours to get to
> the above failure point, even when using mbuffer). I am using custom
> zfs replication scripts, but it fails also using the zrep package.
>
> Does anyone know of a possible explanation? Is there any version of
> ZFS 0.8 that works with Lustre 2.12.5?
>
> Thanks,
> Steve

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS w/Lustre problem

2020-11-06 Thread Steve Thompson

This may be a question for the ZFS list...

I have Lustre 2.12.5 on Centos 7.8 with ZFS 0.7.13, 10GB network. I make 
snapshots of the Lustre filesystem with 'lctl snapshot_create' and at a 
later time transfer these snapshots to a backup system with zfs send/recv. 
This works well for everything but the MDT. For the MDT, I find that the 
zfs recv always fails when a little less than 1GB has been transferred 
(this being an incremental send/recv of snapshots taken a day apart):


# zfs send -v -c -i fs0pool/mdt0@03-nov-2020 fs0pool/mdt0@04-nov-2020 | \
zfs recv -F backups/fs0pool/mdt0

12:11:18946M   fs0pool/mdt0@04-nov-2020-01:00
12:11:19946M   fs0pool/mdt0@04-nov-2020-01:00
12:11:20946M   fs0pool/mdt0@04-nov-2020-01:00
cannot receive incremental stream: dataset does not exist

while if the data transfer is much smaller, the send/recv works. Since 
once I get a failure it is not possible to complete a send/recv for any 
subsequent day, I am doing a full snapshot send to a file; this always 
works and takes about 5/6 minutes for my MDT. When using zfs send/recv, 
the recv is always very very slow (several hours to get to the above 
failure point, even when using mbuffer). I am using custom zfs replication 
scripts, but it fails also using the zrep package.


Does anyone know of a possible explanation? Is there any version of ZFS 
0.8 that works with Lustre 2.12.5?


Thanks,
Steve
--

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
  "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS atime is it required?

2020-11-02 Thread Kumar, Amit
Thank you Andreas!! That is helpful.

Regards,
Amit

From: Andreas Dilger 
Sent: Thursday, October 29, 2020 4:28 AM
To: Kumar, Amit 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS atime is it required?

On Oct 23, 2020, at 14:03, Kumar, Amit 
mailto:ahku...@mail.smu.edu>> wrote:

Dear All,

Quick question, can I get away by setting "zfs set atime=off 
on_all_my_voulmes_mgt_mdt_and_osts" ? I ask this as it is noted to be 
performance boosting tip with the assumption filesystems(Lustre) handles all 
access times?

You don't really need atime enabled on the OSTs, but I also don't think 
"atime=off" will make any difference.  That is a VFS/ZPL level option, and 
Lustre osd-zfs doesn't use any of the ZPL code, but rather handles atime 
internally.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS atime is it required?

2020-10-29 Thread Andreas Dilger
On Oct 23, 2020, at 14:03, Kumar, Amit 
mailto:ahku...@mail.smu.edu>> wrote:

Dear All,

Quick question, can I get away by setting “zfs set atime=off 
on_all_my_voulmes_mgt_mdt_and_osts” ? I ask this as it is noted to be 
performance boosting tip with the assumption filesystems(Lustre) handles all 
access times?

You don't really need atime enabled on the OSTs, but I also don't think 
"atime=off" will make any difference.  That is a VFS/ZPL level option, and 
Lustre osd-zfs doesn't use any of the ZPL code, but rather handles atime 
internally.

Cheers, Andreas
--
Andreas Dilger
Principal Lustre Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS atime is it required?

2020-10-23 Thread Kumar, Amit
Dear All,

Quick question, can I get away by setting "zfs set atime=off 
on_all_my_voulmes_mgt_mdt_and_osts" ? I ask this as it is noted to be 
performance boosting tip with the assumption filesystems(Lustre) handles all 
access times?

Thank you,
Amit

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Zfs backend file system level backup

2020-01-30 Thread Yong, Fan
It can be done anytime before umount the Lustre device, in spite of whether it 
is new formatted system or live zfs-based system with data, as long as the 
version supports file-level backup.

--
Cheers,
Nasf

-Original Message-
From: lustre-discuss  On Behalf Of 
BASDEN, ALASTAIR G.
Sent: Thursday, January 30, 2020 12:36 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Zfs backend file system level backup

Hi,
I wish to perform a zfs backend file system level backup.  The Lustre manual 
says to to:
lctl set_param osd-zfs.${fsname}-${target}.index_backup=1

Does this have to be done right at the start before the file system has been 
used, or can it be done to a live file system with data already on it?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Zfs backend file system level backup

2020-01-29 Thread BASDEN, ALASTAIR G.
Hi,
I wish to perform a zfs backend file system level backup.  The Lustre 
manual says to to:
lctl set_param osd-zfs.${fsname}-${target}.index_backup=1

Does this have to be done right at the start before the file system has 
been used, or can it be done to a live file system with data already on 
it?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS mdt

2020-01-27 Thread Hans Henrik Happe
Hi,

Obviously it depends on the number of inodes (files,dirs etc) you have.
However, if the zfs pool uses ashift=12, which will be the default for
most SSDs and large HDDs, it can have quite an overhead for MDTs. It
takes a lot of bandwidth on MDTs if you do a lot of metedata operations.

You can check it by running 'zdb |grep ashift'.

We are using ashift=9 on SSDs for this reason, even though the SDDs
would prefer it differently.

Cheers,
Hans Henrik

On 27.01.2020 19.07, Nehring, Shane R [LAS] wrote:
> Hey all,
> 
> We've been running a lustre volume for a few years now and it's been
> working quite well. We've been using ZFS as the backend storage and
> while that's been working well I've noticed that the space usage is
> a little weird on the mdt:
> 
> NAMEPROPERTY   VALUE SOURCE
> store/work-mdt  used   4.91T -
> store/work-mdt  logicalused960G  -
> store/work-mdt  referenced 4.91T -
> store/work-mdt  logicalreferenced  960G  -
> store/work-mdt  compressratio  1.00x -
> 
> Just wondering if anyone else has noticed this kind of overhead.
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS mdt

2020-01-27 Thread Nehring, Shane R [LAS]
Hey all,

We've been running a lustre volume for a few years now and it's been
working quite well. We've been using ZFS as the backend storage and
while that's been working well I've noticed that the space usage is
a little weird on the mdt:

NAMEPROPERTY   VALUE SOURCE
store/work-mdt  used   4.91T -
store/work-mdt  logicalused960G  -
store/work-mdt  referenced 4.91T -
store/work-mdt  logicalreferenced  960G  -
store/work-mdt  compressratio  1.00x -

Just wondering if anyone else has noticed this kind of overhead.


smime.p7s
Description: S/MIME cryptographic signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs mounting

2019-07-26 Thread BASDEN, ALASTAIR G.
Hi all,

For future reference, it turned out to be the zfs version.

Fixed using:
rmmod zfs
yum remove libzfs2 libzfs2-devel libzpool2 libuutil1 libnvpair1 spl spl-dkms 
zfs zfs-dkms
yum install zfs-0.7.5  lustre-dkms lustre-osd-zfs-mount
Reboot
zpool import
mount -t lustre test-ost0/ost0 /mnt/lustre/test-ost0
...


Thanks for the responses.

Cheers,
Alastair.


On Thu, 25 Jul 2019, BASDEN, ALASTAIR G. wrote:

> Hi,
>
> I am trying to bring up a new zfs backend file system.
> CentOS 7.4, Lustre 2.10.3, zfs 0.7.12.
>
> I do the following:
> zpool create -O canmount=off -o cachefile=none test-ost0 raidz2 
>
> mkfs.lustre --fsname=test --ost --backfstype=zfs --index=0 --mgsnode=nid1 
> test-ost0/ost0
>
> This seems to work. However, I can't work out how to mount:
> mount -t lustre test-ost0/ost0 /mnt/lustre/test-ost0
> mount.lustre: mount test-ost0/ost0 at /mnt/lustre/test-ost0 failed: No such 
> device
> Are the lustre modules loaded?
> Check /etc/modprobe.conf and /proc/filesystems
>
> tail /var/log/messages:
> Jul 25 15:46:12 oss01 kernel: LustreError: 158-c: Can't load module 'osd-zfs'
> Jul 25 15:46:12 oss01 kernel: LustreError: 
> 216602:0:(genops.c:318:class_newdev()) OBD: unknown type: osd-zfs
> Jul 25 15:46:12 oss01 kernel: LustreError: 
> 216602:0:(obd_config.c:402:class_attach()) Cannot create device 
> test-OST-osd of type osd-zfs : -19
> Jul 25 15:46:12 oss01 kernel: LustreError: 
> 216602:0:(obd_mount.c:198:lustre_start_simple()) test-OST-osd attach 
> error -19
> Jul 25 15:46:12 oss01 kernel: LustreError: 
> 216602:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on 
> test-ost0/ost0: -19
> Jul 25 15:46:12 oss01 kernel: LustreError: 
> 216602:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-19)
>
> Is that the correct mount command?  (I've tried a few others too).
>
> Is the problem a version incompatibility between lustre and zfs?
>
> Many thanks,
> Alastair.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs mounting

2019-07-25 Thread João Carlos Mendes Luís

On 25/07/2019 11:49, BASDEN, ALASTAIR G. wrote:

Hi,

I am trying to bring up a new zfs backend file system.
CentOS 7.4, Lustre 2.10.3, zfs 0.7.12.

I do the following:
zpool create -O canmount=off -o cachefile=none test-ost0 raidz2 

mkfs.lustre --fsname=test --ost --backfstype=zfs --index=0 --mgsnode=nid1 
test-ost0/ost0

This seems to work. However, I can't work out how to mount:
mount -t lustre test-ost0/ost0 /mnt/lustre/test-ost0
mount.lustre: mount test-ost0/ost0 at /mnt/lustre/test-ost0 failed: No such 
device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems

tail /var/log/messages:
Jul 25 15:46:12 oss01 kernel: LustreError: 158-c: Can't load module 'osd-zfs'



Did you install the ZFS version of lustre RPM?



Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(genops.c:318:class_newdev()) OBD: unknown type: osd-zfs
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_config.c:402:class_attach()) Cannot create device 
test-OST-osd of type osd-zfs : -19
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_mount.c:198:lustre_start_simple()) test-OST-osd attach error 
-19
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on 
test-ost0/ost0: -19
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-19)

Is that the correct mount command?  (I've tried a few others too).

Is the problem a version incompatibility between lustre and zfs?

Many thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] zfs mounting

2019-07-25 Thread BASDEN, ALASTAIR G.
Hi,

I am trying to bring up a new zfs backend file system.
CentOS 7.4, Lustre 2.10.3, zfs 0.7.12.

I do the following:
zpool create -O canmount=off -o cachefile=none test-ost0 raidz2 

mkfs.lustre --fsname=test --ost --backfstype=zfs --index=0 --mgsnode=nid1 
test-ost0/ost0

This seems to work. However, I can't work out how to mount:
mount -t lustre test-ost0/ost0 /mnt/lustre/test-ost0
mount.lustre: mount test-ost0/ost0 at /mnt/lustre/test-ost0 failed: No such 
device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems

tail /var/log/messages:
Jul 25 15:46:12 oss01 kernel: LustreError: 158-c: Can't load module 'osd-zfs'
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(genops.c:318:class_newdev()) OBD: unknown type: osd-zfs
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_config.c:402:class_attach()) Cannot create device 
test-OST-osd of type osd-zfs : -19
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_mount.c:198:lustre_start_simple()) test-OST-osd attach error 
-19
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_mount_server.c:1832:server_fill_super()) Unable to start osd on 
test-ost0/ost0: -19
Jul 25 15:46:12 oss01 kernel: LustreError: 
216602:0:(obd_mount.c:1506:lustre_fill_super()) Unable to mount  (-19)

Is that the correct mount command?  (I've tried a few others too).

Is the problem a version incompatibility between lustre and zfs?

Many thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tunings

2019-07-16 Thread Degremont, Aurelien
Nobody on this topic?
I'm pretty sure they are lots of people running Lustre on ZFS with various 
tuning applied. Don't be shy 


De : "Degremont, Aurelien" 
Date : mercredi 10 juillet 2019 à 10:35
À : "lustre-discuss@lists.lustre.org" 
Objet : ZFS tunings

Hi all,

I know good default tunings for ZFS when used with Lustre is a big topic. There 
are several pages on wiki or LUG/LAD slides which are few years old and it is 
difficult to know which ones are still relevant when used with recent Lustre 
and ZFS versions.

Does anybody have insight about which tunings are important. Especially for:

  *   zfs atime (does DMU when used by Lustre also updates atime ?)
  *   redundant_metadata=most
  *   metaslab_debug_unload=1
  *   zfs_prefetch_disable=1
  *   zfs_vdev_scheduler=deadline

Lustre already sets ‘xattr=sa’, ‘dnodesize=auto’ and ‘recordsize=1M’ 
automatically since Lustre 2.11.

Thanks

Aurélien


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS tunings

2019-07-10 Thread Degremont, Aurelien
Hi all,

I know good default tunings for ZFS when used with Lustre is a big topic. There 
are several pages on wiki or LUG/LAD slides which are few years old and it is 
difficult to know which ones are still relevant when used with recent Lustre 
and ZFS versions.

Does anybody have insight about which tunings are important. Especially for:

  *   zfs atime (does DMU when used by Lustre also updates atime ?)
  *   redundant_metadata=most
  *   metaslab_debug_unload=1
  *   zfs_prefetch_disable=1
  *   zfs_vdev_scheduler=deadline

Lustre already sets ‘xattr=sa’, ‘dnodesize=auto’ and ‘recordsize=1M’ 
automatically since Lustre 2.11.

Thanks

Aurélien


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS and multipathing for OSTs

2019-04-26 Thread Harr, Cameron
We use a simple multipath config and then have our vdev_id.conf set up like the 
following:

multipath yes

# Intent of channel names:
#   First letter  {L,U} indicates lower or upper enclosure
#  PCI_ID  HBA PORT CHANNEL NAME
channel05:00.0 1L
channel05:00.0 0U
channel06:00.0 1L
channel06:00.0 0U


This results in the devices being available in /dev/disk/by-vdev/L{1..N} and 
U{1..N}. We then create the zpools using those L and U devices. When we fail a 
drive, we use zpool offline -f <[U|L]X> to fault the device (unless the ZED has 
already faulted it automatically). When the drive is "faulted" with that -f 
option, the ZED automatically resilvers the drive when replaced. We never 
manipulate multipath.


On 4/26/19 9:50 AM, Riccardo Veraldi wrote:
for my experience multipathd+ZFS works well, and it worked well usually.
I just remove the broken disk when it happens, replace it and the new 
multipathd device is added once the disk is replaced, and then then I start 
resilvering.
Anyway I found out this not always works with some version of JBOD disk 
array/firmware.
Some Proware controller that I had did not recognize that a disk was replaced. 
But This is not a multipathd problem in my case.
So my hint is to try it out with your hardware and see how it behaves.

On 26/04/2019 16:57, Kurt Strosahl wrote:

Hey, thanks!


I tried the multipathing part you had down there and I couldn't get it to 
work... I did find that this worked though

#I pick a victim device
multipath -ll
...
mpathax (35000cca2680a8194) dm-49 HGST,HUH721010AL5200
size=9.1T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=enabled
  |- 1:0:10:0   sdj 8:144   active ready running
  `- 11:0:9:0   sddy128:0   active ready running
#then I remove the device
multipath -f mpathax
#and verify that it is gone
multipath -ll | grep mpathax
#then I run the following, which seems to rescan for devices.
multipath -v2
Apr 26 10:49:06 | sdj: No SAS end device for 'end_device-1:1'
Apr 26 10:49:06 | sddy: No SAS end device for 'end_device-11:1'
create: mpathax (35000cca2680a8194) undef HGST,HUH721010AL5200
size=9.1T features='0' hwhandler='0' wp=undef
`-+- policy='service-time 0' prio=1 status=undef
  |- 1:0:10:0   sdj 8:144   undef ready running
  `- 11:0:9:0   sddy128:0   undef ready running
#then its back
multipath -ll mpathax
mpathax (35000cca2680a8194) dm-49 HGST,HUH721010AL5200
size=9.1T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=enabled
  |- 1:0:10:0   sdj 8:144   active ready running
  `- 11:0:9:0   sddy128:0   active ready running

I still need to test it fully once I get the whole stack up and running, but 
this seems to be a step in the right direction.


w/r,
Kurt


From: Jongwoo Han <mailto:jongwoo...@gmail.com>
Sent: Friday, April 26, 2019 6:28 AM
To: Kurt Strosahl
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] ZFS and multipathing for OSTs

Disk replacement with multipathd + zfs is somewhat not convenient.

step1: mark offline the disk you should replace with zpool command
step2: remove disk from multipathd table with multipath -f 
step3: replace disk
step4: add disk to multipath table with multipath -ll 
step5:  replace disk in zpool with zpool replace

try this in your test environment and tell us if you have found anything 
interesting in the syslog.
In my case replacing single disk in multipathd+zfs pool triggerd massive udevd 
partition scan.

Thanks
Jongwoo Han

2019년 4월 26일 (금) 오전 3:44, Kurt Strosahl 
mailto:stros...@jlab.org>>님이 작성:

Good Afternoon,


As part of a new lustre deployment I've now got two disk shelves connected 
redundantly to two servers.  Since each disk has two paths to the server I'd 
like to use multipathing for both redundancy and improved performance.  I 
haven't found examples or discussion about such a setup, and was wondering if 
there are any resources out there that I could consult.


Of particular interest would be examples of the /etc/zfs/vdev_id.conf and any 
tuning that was done.  I'm also wondering about extra steps that may have to be 
taken when doing a disk replacement to account for the multipathing.  I've got 
plenty of time to experiment with this process, but I'd rather not reinvent the 
wheel if I don't have to.


w/r,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discu

Re: [lustre-discuss] ZFS and multipathing for OSTs

2019-04-26 Thread Riccardo Veraldi
for my experience multipathd+ZFS works well, and it worked well usually.
I just remove the broken disk when it happens, replace it and the new
multipathd device is added once the disk is replaced, and then then I
start resilvering.
Anyway I found out this not always works with some version of JBOD disk
array/firmware.
Some Proware controller that I had did not recognize that a disk was
replaced. But This is not a multipathd problem in my case.
So my hint is to try it out with your hardware and see how it behaves.

On 26/04/2019 16:57, Kurt Strosahl wrote:
>
> Hey, thanks!
>
>
> I tried the multipathing part you had down there and I couldn't get it
> to work... I did find that this worked though
>
>
> #I pick a victim device
> multipath -ll
> ...
> mpathax (35000cca2680a8194) dm-49 HGST    ,HUH721010AL5200 
> size=9.1T features='0' hwhandler='0' wp=rw
> `-+- policy='service-time 0' prio=1 status=enabled
>   |- 1:0:10:0   sdj     8:144   active ready running
>   `- 11:0:9:0   sddy    128:0   active ready running
> #then I remove the device
> multipath -f mpathax
> #and verify that it is gone
> multipath -ll | grep mpathax
> #then I run the following, which seems to rescan for devices.
> multipath -v2
> Apr 26 10:49:06 | sdj: No SAS end device for 'end_device-1:1'
> Apr 26 10:49:06 | sddy: No SAS end device for 'end_device-11:1'
> create: mpathax (35000cca2680a8194) undef HGST    ,HUH721010AL5200 
> size=9.1T features='0' hwhandler='0' wp=undef
> `-+- policy='service-time 0' prio=1 status=undef
>   |- 1:0:10:0   sdj     8:144   undef ready running
>   `- 11:0:9:0   sddy    128:0   undef ready running
> #then its back
> multipath -ll mpathax
> mpathax (35000cca2680a8194) dm-49 HGST    ,HUH721010AL5200 
> size=9.1T features='0' hwhandler='0' wp=rw
> `-+- policy='service-time 0' prio=1 status=enabled
>   |- 1:0:10:0   sdj     8:144   active ready running
>   `- 11:0:9:0   sddy    128:0   active ready running
>
> I still need to test it fully once I get the whole stack up and
> running, but this seems to be a step in the right direction.
>
>
> w/r,
> Kurt
>
> 
> *From:* Jongwoo Han 
> *Sent:* Friday, April 26, 2019 6:28 AM
> *To:* Kurt Strosahl
> *Cc:* lustre-discuss@lists.lustre.org
> *Subject:* Re: [lustre-discuss] ZFS and multipathing for OSTs
>  
> Disk replacement with multipathd + zfs is somewhat not convenient.
>
> step1: mark offline the disk you should replace with zpool command
> step2: remove disk from multipathd table with multipath -f 
> step3: replace disk
> step4: add disk to multipath table with multipath -ll 
> step5:  replace disk in zpool with zpool replace
>
> try this in your test environment and tell us if you have found
> anything interesting in the syslog.
> In my case replacing single disk in multipathd+zfs pool triggerd
> massive udevd partition scan. 
>
> Thanks
> Jongwoo Han
>
> 2019년 4월 26일 (금) 오전 3:44, Kurt Strosahl  <mailto:stros...@jlab.org>>님이 작성:
>
> Good Afternoon,
>
>
>     As part of a new lustre deployment I've now got two disk
> shelves connected redundantly to two servers.  Since each disk has
> two paths to the server I'd like to use multipathing for both
> redundancy and improved performance.  I haven't found examples or
> discussion about such a setup, and was wondering if there are any
> resources out there that I could consult.
>
>
> Of particular interest would be examples of the
> /etc/zfs/vdev_id.conf and any tuning that was done.  I'm also
> wondering about extra steps that may have to be taken when doing a
> disk replacement to account for the multipathing.  I've got plenty
> of time to experiment with this process, but I'd rather not
> reinvent the wheel if I don't have to.
>
>
> w/r,
>
> Kurt J. Strosahl
> System Administrator: Lustre, HPC
> Scientific Computing Group, Thomas Jefferson National Accelerator
> Facility
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> <mailto:lustre-discuss@lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> <https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org=02%7C01%7Cstrosahl%40jlab.org%7Cba16f1aff6144708f17708d6ca31e3ee%7Cb4d7ee1f4fb34f0690372b5b522042ab%7C1%7C1%7C636918712958511376=p6QC1JIfSnyq8IC1SgOJWlWdcD2Drs9vbtrutuynGEs%3D=0>
>
>
>
> -- 
> Jongwoo Han
> +82-505-227-6108
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS and multipathing for OSTs

2019-04-26 Thread Kurt Strosahl
Hey, thanks!


I tried the multipathing part you had down there and I couldn't get it to 
work... I did find that this worked though

#I pick a victim device
multipath -ll
...
mpathax (35000cca2680a8194) dm-49 HGST,HUH721010AL5200
size=9.1T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=enabled
  |- 1:0:10:0   sdj 8:144   active ready running
  `- 11:0:9:0   sddy128:0   active ready running
#then I remove the device
multipath -f mpathax
#and verify that it is gone
multipath -ll | grep mpathax
#then I run the following, which seems to rescan for devices.
multipath -v2
Apr 26 10:49:06 | sdj: No SAS end device for 'end_device-1:1'
Apr 26 10:49:06 | sddy: No SAS end device for 'end_device-11:1'
create: mpathax (35000cca2680a8194) undef HGST,HUH721010AL5200
size=9.1T features='0' hwhandler='0' wp=undef
`-+- policy='service-time 0' prio=1 status=undef
  |- 1:0:10:0   sdj 8:144   undef ready running
  `- 11:0:9:0   sddy128:0   undef ready running
#then its back
multipath -ll mpathax
mpathax (35000cca2680a8194) dm-49 HGST,HUH721010AL5200
size=9.1T features='0' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=enabled
  |- 1:0:10:0   sdj 8:144   active ready running
  `- 11:0:9:0   sddy128:0   active ready running

I still need to test it fully once I get the whole stack up and running, but 
this seems to be a step in the right direction.


w/r,
Kurt


From: Jongwoo Han 
Sent: Friday, April 26, 2019 6:28 AM
To: Kurt Strosahl
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS and multipathing for OSTs

Disk replacement with multipathd + zfs is somewhat not convenient.

step1: mark offline the disk you should replace with zpool command
step2: remove disk from multipathd table with multipath -f 
step3: replace disk
step4: add disk to multipath table with multipath -ll 
step5:  replace disk in zpool with zpool replace

try this in your test environment and tell us if you have found anything 
interesting in the syslog.
In my case replacing single disk in multipathd+zfs pool triggerd massive udevd 
partition scan.

Thanks
Jongwoo Han

2019년 4월 26일 (금) 오전 3:44, Kurt Strosahl 
mailto:stros...@jlab.org>>님이 작성:

Good Afternoon,


As part of a new lustre deployment I've now got two disk shelves connected 
redundantly to two servers.  Since each disk has two paths to the server I'd 
like to use multipathing for both redundancy and improved performance.  I 
haven't found examples or discussion about such a setup, and was wondering if 
there are any resources out there that I could consult.


Of particular interest would be examples of the /etc/zfs/vdev_id.conf and any 
tuning that was done.  I'm also wondering about extra steps that may have to be 
taken when doing a disk replacement to account for the multipathing.  I've got 
plenty of time to experiment with this process, but I'd rather not reinvent the 
wheel if I don't have to.


w/r,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<https://gcc01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org=02%7C01%7Cstrosahl%40jlab.org%7Cba16f1aff6144708f17708d6ca31e3ee%7Cb4d7ee1f4fb34f0690372b5b522042ab%7C1%7C1%7C636918712958511376=p6QC1JIfSnyq8IC1SgOJWlWdcD2Drs9vbtrutuynGEs%3D=0>


--
Jongwoo Han
+82-505-227-6108
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS and multipathing for OSTs

2019-04-26 Thread Jongwoo Han
Disk replacement with multipathd + zfs is somewhat not convenient.

step1: mark offline the disk you should replace with zpool command
step2: remove disk from multipathd table with multipath -f 
step3: replace disk
step4: add disk to multipath table with multipath -ll 
step5:  replace disk in zpool with zpool replace

try this in your test environment and tell us if you have found anything
interesting in the syslog.
In my case replacing single disk in multipathd+zfs pool triggerd massive
udevd partition scan.

Thanks
Jongwoo Han

2019년 4월 26일 (금) 오전 3:44, Kurt Strosahl 님이 작성:

> Good Afternoon,
>
>
> As part of a new lustre deployment I've now got two disk shelves
> connected redundantly to two servers.  Since each disk has two paths to the
> server I'd like to use multipathing for both redundancy and improved
> performance.  I haven't found examples or discussion about such a setup,
> and was wondering if there are any resources out there that I could consult.
>
>
> Of particular interest would be examples of the /etc/zfs/vdev_id.conf and
> any tuning that was done.  I'm also wondering about extra steps that may
> have to be taken when doing a disk replacement to account for the
> multipathing.  I've got plenty of time to experiment with this process, but
> I'd rather not reinvent the wheel if I don't have to.
>
>
> w/r,
>
> Kurt J. Strosahl
> System Administrator: Lustre, HPC
> Scientific Computing Group, Thomas Jefferson National Accelerator Facility
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
Jongwoo Han
+82-505-227-6108
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS and multipathing for OSTs

2019-04-25 Thread Kurt Strosahl
Good Afternoon,


As part of a new lustre deployment I've now got two disk shelves connected 
redundantly to two servers.  Since each disk has two paths to the server I'd 
like to use multipathing for both redundancy and improved performance.  I 
haven't found examples or discussion about such a setup, and was wondering if 
there are any resources out there that I could consult.


Of particular interest would be examples of the /etc/zfs/vdev_id.conf and any 
tuning that was done.  I'm also wondering about extra steps that may have to be 
taken when doing a disk replacement to account for the multipathing.  I've got 
plenty of time to experiment with this process, but I'd rather not reinvent the 
wheel if I don't have to.


w/r,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-04-02 Thread Hans Henrik Happe
AFAIK, that is what sync=disabled does. It pretends syncs are commited.
It will flush after 5 seconds but there might be other output that will
stall it longer.

On 02/04/2019 14.28, Degremont, Aurelien wrote:
> This is very unlikely.
> The only reason that could happened is this hardware is acknowledging I/O to 
> Lustre that it did not really commit to disk like writeback cache, or a 
> Lustre bug. 
> 
> Le 02/04/2019 14:11, « lustre-discuss au nom de Hans Henrik Happe » 
>  a écrit :
> 
> Isn't there a possibility that the MDS falsely tells the client that a
> transaction has been committed to disk. After that the client might not
> be able to replay, if the MDS dies.
> 
> Cheers,
> Hans Henrik
> 
> On 19/03/2019 21.32, Andreas Dilger wrote:
> > You would need to lose the MDS within a few seconds after the client to
> > lose filesystem operations, since the clients will replay their
> > operations if the MDS crashes, and ZFS commits the current transaction
> > every 1s, so this setting only really affects "sync" from the client. 
> > 
> > Cheers, Andreas
> > 
> > On Mar 19, 2019, at 12:43, George Melikov  > > wrote:
> > 
> >> Can you explain the reason about 'zfs set sync=disabled mdt0'? Are you
> >> ready to lose last transaction on that mdt during power failure? What
> >> did I miss?
> >>
> >> 14.03.2019, 01:00, "Riccardo Veraldi"  >> >:
> >>> these are the zfs settings I use on my MDSes
> >>>
> >>>  zfs set mountpoint=none mdt0
> >>>  zfs set sync=disabled mdt0
> >>>
> >>>  zfs set atime=off amdt0
> >>>  zfs set redundant_metadata=most mdt0
> >>>  zfs set xattr=sa mdt0
> >>>
> >>> if youor MDT partition is on a 4KB sector disk then you can use
> >>> ashift=12 when you create the filesystem but zfs is pretty smart and
> >>> in my case it recognized it automatically and used ashift=12
> >>> automatically.
> >>>
> >>> also here are the zfs kernel modules parameters i use to ahve better
> >>> performance. I use it on both MDS and OSSes
> >>>
> >>> options zfs zfs_prefetch_disable=1
> >>> options zfs zfs_txg_history=120
> >>> options zfs metaslab_debug_unload=1
> >>> #
> >>> options zfs zfs_vdev_scheduler=deadline
> >>> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
> >>> #
> >>> options zfs zfs_vdev_scrub_min_active=48
> >>> options zfs zfs_vdev_scrub_max_active=128
> >>> #options zfs zfs_vdev_sync_write_min_active=64
> >>> #options zfs zfs_vdev_sync_write_max_active=128
> >>> #
> >>> options zfs zfs_vdev_sync_write_min_active=8
> >>> options zfs zfs_vdev_sync_write_max_active=32
> >>> options zfs zfs_vdev_sync_read_min_active=8
> >>> options zfs zfs_vdev_sync_read_max_active=32
> >>> options zfs zfs_vdev_async_read_min_active=8
> >>> options zfs zfs_vdev_async_read_max_active=32
> >>> options zfs zfs_top_maxinflight=320
> >>> options zfs zfs_txg_timeout=30
> >>> options zfs zfs_dirty_data_max_percent=40
> >>> options zfs zfs_vdev_async_write_min_active=8
> >>> options zfs zfs_vdev_async_write_max_active=32
> >>>
> >>> some people may disagree with me anyway after years of trying
> >>> different options I reached this stable configuration.
> >>>
> >>> then there are a bunch of other important Lustre level optimizations
> >>> that you can do if you are looking for performance increase.
> >>>
> >>> Cheers
> >>>
> >>> Rick
> >>>
> >>> On 3/13/19 11:44 AM, Kurt Strosahl wrote:
> 
>  Good Afternoon,
> 
> 
>  I'm reviewing the zfs parameters for a new metadata system and I
>  was looking to see if anyone had examples (good or bad) of zfs
>  parameters?  I'm assuming that the MDT won't benefit from a
>  recordsize of 1MB, and I've already set the ashift to 12.  I'm using
>  an MDT/MGS made up of a stripe across mirrored ssds.
> 
> 
>  w/r,
> 
>  Kurt
> 
> 
>  ___
>  lustre-discuss mailing list
>  lustre-discuss@lists.lustre.org 
> 
>  http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o… 
> 
> >>>
> >>>
> >>> ___
> >>> lustre-discuss mailing list
> >>> lustre-discuss@lists.lustre.org
> >>> 
> >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o…
> >>> 
> >>>
> >>
> >>
> >> 

Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-04-02 Thread Degremont, Aurelien
This is very unlikely.
The only reason that could happened is this hardware is acknowledging I/O to 
Lustre that it did not really commit to disk like writeback cache, or a Lustre 
bug. 

Le 02/04/2019 14:11, « lustre-discuss au nom de Hans Henrik Happe » 
 a écrit :

Isn't there a possibility that the MDS falsely tells the client that a
transaction has been committed to disk. After that the client might not
be able to replay, if the MDS dies.

Cheers,
Hans Henrik

On 19/03/2019 21.32, Andreas Dilger wrote:
> You would need to lose the MDS within a few seconds after the client to
> lose filesystem operations, since the clients will replay their
> operations if the MDS crashes, and ZFS commits the current transaction
> every 1s, so this setting only really affects "sync" from the client. 
> 
> Cheers, Andreas
> 
> On Mar 19, 2019, at 12:43, George Melikov  > wrote:
> 
>> Can you explain the reason about 'zfs set sync=disabled mdt0'? Are you
>> ready to lose last transaction on that mdt during power failure? What
>> did I miss?
>>
>> 14.03.2019, 01:00, "Riccardo Veraldi" > >:
>>> these are the zfs settings I use on my MDSes
>>>
>>>  zfs set mountpoint=none mdt0
>>>  zfs set sync=disabled mdt0
>>>
>>>  zfs set atime=off amdt0
>>>  zfs set redundant_metadata=most mdt0
>>>  zfs set xattr=sa mdt0
>>>
>>> if youor MDT partition is on a 4KB sector disk then you can use
>>> ashift=12 when you create the filesystem but zfs is pretty smart and
>>> in my case it recognized it automatically and used ashift=12
>>> automatically.
>>>
>>> also here are the zfs kernel modules parameters i use to ahve better
>>> performance. I use it on both MDS and OSSes
>>>
>>> options zfs zfs_prefetch_disable=1
>>> options zfs zfs_txg_history=120
>>> options zfs metaslab_debug_unload=1
>>> #
>>> options zfs zfs_vdev_scheduler=deadline
>>> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
>>> #
>>> options zfs zfs_vdev_scrub_min_active=48
>>> options zfs zfs_vdev_scrub_max_active=128
>>> #options zfs zfs_vdev_sync_write_min_active=64
>>> #options zfs zfs_vdev_sync_write_max_active=128
>>> #
>>> options zfs zfs_vdev_sync_write_min_active=8
>>> options zfs zfs_vdev_sync_write_max_active=32
>>> options zfs zfs_vdev_sync_read_min_active=8
>>> options zfs zfs_vdev_sync_read_max_active=32
>>> options zfs zfs_vdev_async_read_min_active=8
>>> options zfs zfs_vdev_async_read_max_active=32
>>> options zfs zfs_top_maxinflight=320
>>> options zfs zfs_txg_timeout=30
>>> options zfs zfs_dirty_data_max_percent=40
>>> options zfs zfs_vdev_async_write_min_active=8
>>> options zfs zfs_vdev_async_write_max_active=32
>>>
>>> some people may disagree with me anyway after years of trying
>>> different options I reached this stable configuration.
>>>
>>> then there are a bunch of other important Lustre level optimizations
>>> that you can do if you are looking for performance increase.
>>>
>>> Cheers
>>>
>>> Rick
>>>
>>> On 3/13/19 11:44 AM, Kurt Strosahl wrote:

 Good Afternoon,


 I'm reviewing the zfs parameters for a new metadata system and I
 was looking to see if anyone had examples (good or bad) of zfs
 parameters?  I'm assuming that the MDT won't benefit from a
 recordsize of 1MB, and I've already set the ashift to 12.  I'm using
 an MDT/MGS made up of a stripe across mirrored ssds.


 w/r,

 Kurt


 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org 

 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o… 

>>>
>>>
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> 
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o…
>>> 
>>>
>>
>>
>> 
>> Sincerely,
>> George Melikov
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org 
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> 

Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-04-02 Thread Hans Henrik Happe
Isn't there a possibility that the MDS falsely tells the client that a
transaction has been committed to disk. After that the client might not
be able to replay, if the MDS dies.

Cheers,
Hans Henrik

On 19/03/2019 21.32, Andreas Dilger wrote:
> You would need to lose the MDS within a few seconds after the client to
> lose filesystem operations, since the clients will replay their
> operations if the MDS crashes, and ZFS commits the current transaction
> every 1s, so this setting only really affects "sync" from the client. 
> 
> Cheers, Andreas
> 
> On Mar 19, 2019, at 12:43, George Melikov  > wrote:
> 
>> Can you explain the reason about 'zfs set sync=disabled mdt0'? Are you
>> ready to lose last transaction on that mdt during power failure? What
>> did I miss?
>>
>> 14.03.2019, 01:00, "Riccardo Veraldi" > >:
>>> these are the zfs settings I use on my MDSes
>>>
>>>  zfs set mountpoint=none mdt0
>>>  zfs set sync=disabled mdt0
>>>
>>>  zfs set atime=off amdt0
>>>  zfs set redundant_metadata=most mdt0
>>>  zfs set xattr=sa mdt0
>>>
>>> if youor MDT partition is on a 4KB sector disk then you can use
>>> ashift=12 when you create the filesystem but zfs is pretty smart and
>>> in my case it recognized it automatically and used ashift=12
>>> automatically.
>>>
>>> also here are the zfs kernel modules parameters i use to ahve better
>>> performance. I use it on both MDS and OSSes
>>>
>>> options zfs zfs_prefetch_disable=1
>>> options zfs zfs_txg_history=120
>>> options zfs metaslab_debug_unload=1
>>> #
>>> options zfs zfs_vdev_scheduler=deadline
>>> options zfs zfs_vdev_async_write_active_min_dirty_percent=20
>>> #
>>> options zfs zfs_vdev_scrub_min_active=48
>>> options zfs zfs_vdev_scrub_max_active=128
>>> #options zfs zfs_vdev_sync_write_min_active=64
>>> #options zfs zfs_vdev_sync_write_max_active=128
>>> #
>>> options zfs zfs_vdev_sync_write_min_active=8
>>> options zfs zfs_vdev_sync_write_max_active=32
>>> options zfs zfs_vdev_sync_read_min_active=8
>>> options zfs zfs_vdev_sync_read_max_active=32
>>> options zfs zfs_vdev_async_read_min_active=8
>>> options zfs zfs_vdev_async_read_max_active=32
>>> options zfs zfs_top_maxinflight=320
>>> options zfs zfs_txg_timeout=30
>>> options zfs zfs_dirty_data_max_percent=40
>>> options zfs zfs_vdev_async_write_min_active=8
>>> options zfs zfs_vdev_async_write_max_active=32
>>>
>>> some people may disagree with me anyway after years of trying
>>> different options I reached this stable configuration.
>>>
>>> then there are a bunch of other important Lustre level optimizations
>>> that you can do if you are looking for performance increase.
>>>
>>> Cheers
>>>
>>> Rick
>>>
>>> On 3/13/19 11:44 AM, Kurt Strosahl wrote:

 Good Afternoon,


     I'm reviewing the zfs parameters for a new metadata system and I
 was looking to see if anyone had examples (good or bad) of zfs
 parameters?  I'm assuming that the MDT won't benefit from a
 recordsize of 1MB, and I've already set the ashift to 12.  I'm using
 an MDT/MGS made up of a stripe across mirrored ssds.


 w/r,

 Kurt


 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org 
 
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o… 
 
>>>
>>>
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> 
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o…
>>> 
>>>
>>
>>
>> 
>> Sincerely,
>> George Melikov
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org 
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-03-20 Thread Riccardo Veraldi

On 3/19/19 11:46 AM, Degremont, Aurelien wrote:


Also, if you’re not using Lustre 2.11 or 2.12, do not forget 
dnodesize=auto and recordsize=1M for OST


zfs set dnodesize=auto mdt0

zfs set dnodesize=auto ostX

https://jira.whamcloud.com/browse/LU-8342


good point, thank you



(useful for 2.10 LTS. Automatically done by Lustre for 2.11+)

*De : *lustre-discuss  au nom 
de "Carlson, Timothy S" 

*Date : *mercredi 13 mars 2019 à 23:07
*À : *Riccardo Veraldi , Kurt Strosahl 
, "lustre-discuss@lists.lustre.org" 


*Objet : *Re: [lustre-discuss] ZFS tuning for MDT/MGS

+1 on

options zfs zfs_prefetch_disable=1


Might not be as critical now, but that was a must-have on Lustre 2.5.x

Tim

*From:* lustre-discuss  *On 
Behalf Of *Riccardo Veraldi

*Sent:* Wednesday, March 13, 2019 3:00 PM
*To:* Kurt Strosahl ; lustre-discuss@lists.lustre.org
*Subject:* Re: [lustre-discuss] ZFS tuning for MDT/MGS

these are the zfs settings I use on my MDSes


 zfs set mountpoint=none mdt0
 zfs set sync=disabled mdt0
 zfs set atime=off amdt0
 zfs set redundant_metadata=most mdt0
 zfs set xattr=sa mdt0

if youor MDT partition is on a 4KB sector disk then you can use 
ashift=12 when you create the filesystem but zfs is pretty smart and 
in my case it recognized it automatically and used ashift=12 
automatically.


also here are the zfs kernel modules parameters i use to ahve better 
performance. I use it on both MDS and OSSes


options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

some people may disagree with me anyway after years of trying 
different options I reached this stable configuration.


then there are a bunch of other important Lustre level optimizations 
that you can do if you are looking for performance increase.


Cheers

Rick

On 3/13/19 11:44 AM, Kurt Strosahl wrote:

Good Afternoon,

    I'm reviewing the zfs parameters for a new metadata system and
I was looking to see if anyone had examples (good or bad) of zfs
parameters?  I'm assuming that the MDT won't benefit from a
recordsize of 1MB, and I've already set the ashift to 12.  I'm
using an MDT/MGS made up of a stripe across mirrored ssds.

w/r,

Kurt




___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org  <mailto:lustre-discuss@lists.lustre.org>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-03-19 Thread Degremont, Aurelien
Also, if you’re not using Lustre 2.11 or 2.12, do not forget dnodesize=auto and 
recordsize=1M for OST

zfs set dnodesize=auto mdt0
zfs set dnodesize=auto ostX

https://jira.whamcloud.com/browse/LU-8342

(useful for 2.10 LTS. Automatically done by Lustre for 2.11+)

De : lustre-discuss  au nom de 
"Carlson, Timothy S" 
Date : mercredi 13 mars 2019 à 23:07
À : Riccardo Veraldi , Kurt Strosahl 
, "lustre-discuss@lists.lustre.org" 

Objet : Re: [lustre-discuss] ZFS tuning for MDT/MGS

+1 on

options zfs zfs_prefetch_disable=1


Might not be as critical now, but that was a must-have on Lustre 2.5.x

Tim

From: lustre-discuss  On Behalf Of 
Riccardo Veraldi
Sent: Wednesday, March 13, 2019 3:00 PM
To: Kurt Strosahl ; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS tuning for MDT/MGS

these are the zfs settings I use on my MDSes

 zfs set mountpoint=none mdt0
 zfs set sync=disabled mdt0
 zfs set atime=off amdt0
 zfs set redundant_metadata=most mdt0
 zfs set xattr=sa mdt0

if youor MDT partition is on a 4KB sector disk then you can use ashift=12 when 
you create the filesystem but zfs is pretty smart and in my case it recognized 
it automatically and used ashift=12 automatically.

also here are the zfs kernel modules parameters i use to ahve better 
performance. I use it on both MDS and OSSes

options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

some people may disagree with me anyway after years of trying different options 
I reached this stable configuration.

then there are a bunch of other important Lustre level optimizations that you 
can do if you are looking for performance increase.

Cheers

Rick

On 3/13/19 11:44 AM, Kurt Strosahl wrote:

Good Afternoon,



I'm reviewing the zfs parameters for a new metadata system and I was 
looking to see if anyone had examples (good or bad) of zfs parameters?  I'm 
assuming that the MDT won't benefit from a recordsize of 1MB, and I've already 
set the ashift to 12.  I'm using an MDT/MGS made up of a stripe across mirrored 
ssds.



w/r,

Kurt




___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-03-13 Thread Carlson, Timothy S
+1 on

options zfs zfs_prefetch_disable=1

Might not be as critical now, but that was a must-have on Lustre 2.5.x

Tim

From: lustre-discuss  On Behalf Of 
Riccardo Veraldi
Sent: Wednesday, March 13, 2019 3:00 PM
To: Kurt Strosahl ; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] ZFS tuning for MDT/MGS

these are the zfs settings I use on my MDSes

 zfs set mountpoint=none mdt0
 zfs set sync=disabled mdt0
 zfs set atime=off amdt0
 zfs set redundant_metadata=most mdt0
 zfs set xattr=sa mdt0

if youor MDT partition is on a 4KB sector disk then you can use ashift=12 when 
you create the filesystem but zfs is pretty smart and in my case it recognized 
it automatically and used ashift=12 automatically.

also here are the zfs kernel modules parameters i use to ahve better 
performance. I use it on both MDS and OSSes

options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

some people may disagree with me anyway after years of trying different options 
I reached this stable configuration.

then there are a bunch of other important Lustre level optimizations that you 
can do if you are looking for performance increase.

Cheers

Rick

On 3/13/19 11:44 AM, Kurt Strosahl wrote:

Good Afternoon,



I'm reviewing the zfs parameters for a new metadata system and I was 
looking to see if anyone had examples (good or bad) of zfs parameters?  I'm 
assuming that the MDT won't benefit from a recordsize of 1MB, and I've already 
set the ashift to 12.  I'm using an MDT/MGS made up of a stripe across mirrored 
ssds.



w/r,

Kurt



___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS tuning for MDT/MGS

2019-03-13 Thread Riccardo Veraldi

these are the zfs settings I use on my MDSes

 zfs set mountpoint=none mdt0
 zfs set sync=disabled mdt0
 zfs set atime=off amdt0
 zfs set redundant_metadata=most mdt0
 zfs set xattr=sa mdt0

if youor MDT partition is on a 4KB sector disk then you can use 
ashift=12 when you create the filesystem but zfs is pretty smart and in 
my case it recognized it automatically and used ashift=12 automatically.


also here are the zfs kernel modules parameters i use to ahve better 
performance. I use it on both MDS and OSSes


options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

some people may disagree with me anyway after years of trying different 
options I reached this stable configuration.


then there are a bunch of other important Lustre level optimizations 
that you can do if you are looking for performance increase.


Cheers

Rick

On 3/13/19 11:44 AM, Kurt Strosahl wrote:


Good Afternoon,


    I'm reviewing the zfs parameters for a new metadata system and I 
was looking to see if anyone had examples (good or bad) of zfs 
parameters? I'm assuming that the MDT won't benefit from a recordsize 
of 1MB, and I've already set the ashift to 12.  I'm using an MDT/MGS 
made up of a stripe across mirrored ssds.



w/r,

Kurt


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS tuning for MDT/MGS

2019-03-13 Thread Kurt Strosahl
Good Afternoon,


I'm reviewing the zfs parameters for a new metadata system and I was 
looking to see if anyone had examples (good or bad) of zfs parameters?  I'm 
assuming that the MDT won't benefit from a recordsize of 1MB, and I've already 
set the ashift to 12.  I'm using an MDT/MGS made up of a stripe across mirrored 
ssds.


w/r,

Kurt
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS based OSTs need advice

2018-06-27 Thread Zeeshan Ali Shah
Thanks a lot for guidance, I wl kick the installation in 1-2 days.

/Zee

On Wed, Jun 27, 2018 at 2:16 AM Cowe, Malcolm J 
wrote:

> You can create pools and format the storage on a single node, provided
> that the correct `--servicenode` parameters are applied to the format
> command (i.e. the NIDs for each OSS in the HA pair). Then export half of
> the ZFS pools from the first node and import them to the other node.
>
>
>
> There is some documentation that describes the process here:
>
>
>
> http://wiki.lustre.org/Category:Lustre_Systems_Administration
>
>
>
> This includes sections on HA with Pacemaker:
>
>
>
> http://wiki.lustre.org/Managing_Lustre_as_a_High_Availability_Service
>
>
> http://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker
>
>
> http://wiki.lustre.org/Lustre_Server_Fault_Isolation_with_Pacemaker_Node_Fencing
>
>
> http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>
>
>
>
>
> For OSD and OSS stuff:
>
>
>
> http://wiki.lustre.org/ZFS_OSD_Storage_Basics
>
> http://wiki.lustre.org/Introduction_to_Lustre_Object_Storage_Devices_(OSDs)
>
> http://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)
>
>
>
> There are also sections that cover the MGT and MDTs.
>
>
>
> Malcolm.
>
>
>
>
>
> *From: *lustre-discuss  on
> behalf of Zeeshan Ali Shah 
> *Date: *Wednesday, 27 June 2018 at 1:53 am
> *To: *Lustre discussion 
> *Subject: *Re: [lustre-discuss] ZFS based OSTs need advice
>
>
>
> Our OST are based on supermicro SSG-J4000-LUSTRE-OST , it is a kind of
> JBOD.
>
>
>
> all 360 Disks (90 disks x4 OST) appear in /dev/disk in both OSS1 and OSS2
> .
>
>
>
> My idea is to create zfspool of Raidz2 (9+2 spare) which means arround 36
> zfspools will be created  .
>
>
>
> Q1) Out of 36 zfs pools shall i create all of 36 Pools in OSS1 ?  in this
> case those pools can only be imported in OSS1 not in OSS2 how to gain HA
> /active/active here=
>
> Q2)  2nd option is to create  18 zfspools in OSS1 and 18 in OSS2 ? later
> in mkfs.luster specify oss1 as primary and oss2 in secondary (execute it in
> oss1) and 2nd time execute same command on oss2 and make oss2 primary and
> oss1 secondary .
>
>
>
> does it make sense ? am i missing some thing
>
>
>
> Thanks a lot
>
>
>
>
>
> /Zee
>
>
>
>
>
> On Tue, Jun 26, 2018 at 5:38 PM, Dzmitryj Jakavuk 
> wrote:
>
> Hello
>
> You can share 4 osts between pair of oss making 2 osts imported into one
> oss and 2 osts into other oss.  At the same time hdds need to be shared
> between all oss. So in normal conditions  1 oss will import 2 ost and the
> second oss will import   Other 2 osts.in case of ha single oss can import
> all 4osts
>
> Kind Regards
> Dzmitryj Jakavuk
>
>
> > On Jun 26, 2018, at 16:02, Zeeshan Ali Shah 
> wrote:
> >
> > We have 2 OSS with 4 OST shared . Each OST has 90 Disk so total 360
> Disks .
> >
> > I am in phase of installing 2OSS as active/active but as zfs pools can
> only be imported in single OSS host in this case how to achieve
> active/active HA ?
> > As what i read is that for active/active both HA hosts should have
> access to a same sets of disks/volumes.
> >
> > any advice ?
> >
> >
> > /Zeeshan
> >
> >
> >
>
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS based OSTs need advice

2018-06-26 Thread Cowe, Malcolm J
You can create pools and format the storage on a single node, provided that the 
correct `--servicenode` parameters are applied to the format command (i.e. the 
NIDs for each OSS in the HA pair). Then export half of the ZFS pools from the 
first node and import them to the other node.

There is some documentation that describes the process here:

http://wiki.lustre.org/Category:Lustre_Systems_Administration

This includes sections on HA with Pacemaker:

http://wiki.lustre.org/Managing_Lustre_as_a_High_Availability_Service
http://wiki.lustre.org/Creating_a_Framework_for_High_Availability_with_Pacemaker
http://wiki.lustre.org/Lustre_Server_Fault_Isolation_with_Pacemaker_Node_Fencing
http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services


For OSD and OSS stuff:

http://wiki.lustre.org/ZFS_OSD_Storage_Basics
http://wiki.lustre.org/Introduction_to_Lustre_Object_Storage_Devices_(OSDs)
http://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)

There are also sections that cover the MGT and MDTs.

Malcolm.


From: lustre-discuss  on behalf of 
Zeeshan Ali Shah 
Date: Wednesday, 27 June 2018 at 1:53 am
To: Lustre discussion 
Subject: Re: [lustre-discuss] ZFS based OSTs need advice

Our OST are based on supermicro SSG-J4000-LUSTRE-OST , it is a kind of JBOD.


all 360 Disks (90 disks x4 OST) appear in /dev/disk in both OSS1 and OSS2 .

My idea is to create zfspool of Raidz2 (9+2 spare) which means arround 36 
zfspools will be created  .

Q1) Out of 36 zfs pools shall i create all of 36 Pools in OSS1 ?  in this case 
those pools can only be imported in OSS1 not in OSS2 how to gain HA 
/active/active here=
Q2)  2nd option is to create  18 zfspools in OSS1 and 18 in OSS2 ? later in 
mkfs.luster specify oss1 as primary and oss2 in secondary (execute it in oss1) 
and 2nd time execute same command on oss2 and make oss2 primary and oss1 
secondary .

does it make sense ? am i missing some thing

Thanks a lot


/Zee


On Tue, Jun 26, 2018 at 5:38 PM, Dzmitryj Jakavuk 
mailto:dzmit...@gmail.com>> wrote:
Hello

You can share 4 osts between pair of oss making 2 osts imported into one oss 
and 2 osts into other oss.  At the same time hdds need to be shared between all 
oss. So in normal conditions  1 oss will import 2 ost and the second oss will 
import   Other 2 osts.in<http://osts.in> case of ha single oss can import all 
4osts

Kind Regards
Dzmitryj Jakavuk

> On Jun 26, 2018, at 16:02, Zeeshan Ali Shah 
> mailto:javacli...@gmail.com>> wrote:
>
> We have 2 OSS with 4 OST shared . Each OST has 90 Disk so total 360 Disks .
>
> I am in phase of installing 2OSS as active/active but as zfs pools can only 
> be imported in single OSS host in this case how to achieve active/active HA ?
> As what i read is that for active/active both HA hosts should have access to 
> a same sets of disks/volumes.
>
> any advice ?
>
>
> /Zeeshan
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS based OSTs need advice

2018-06-26 Thread Zeeshan Ali Shah
Our OST are based on supermicro SSG-J4000-LUSTRE-OST , it is a kind of
JBOD.

all 360 Disks (90 disks x4 OST) appear in /dev/disk in both OSS1 and OSS2 .

My idea is to create zfspool of Raidz2 (9+2 spare) which means arround 36
zfspools will be created  .

Q1) Out of 36 zfs pools shall i create all of 36 Pools in OSS1 ?  in this
case those pools can only be imported in OSS1 not in OSS2 how to gain HA
/active/active here=
Q2)  2nd option is to create  18 zfspools in OSS1 and 18 in OSS2 ? later in
mkfs.luster specify oss1 as primary and oss2 in secondary (execute it in
oss1) and 2nd time execute same command on oss2 and make oss2 primary and
oss1 secondary .

does it make sense ? am i missing some thing

Thanks a lot


/Zee


On Tue, Jun 26, 2018 at 5:38 PM, Dzmitryj Jakavuk 
wrote:

> Hello
>
> You can share 4 osts between pair of oss making 2 osts imported into one
> oss and 2 osts into other oss.  At the same time hdds need to be shared
> between all oss. So in normal conditions  1 oss will import 2 ost and the
> second oss will import   Other 2 osts.in case of ha single oss can import
> all 4osts
>
> Kind Regards
> Dzmitryj Jakavuk
>
> > On Jun 26, 2018, at 16:02, Zeeshan Ali Shah 
> wrote:
> >
> > We have 2 OSS with 4 OST shared . Each OST has 90 Disk so total 360
> Disks .
> >
> > I am in phase of installing 2OSS as active/active but as zfs pools can
> only be imported in single OSS host in this case how to achieve
> active/active HA ?
> > As what i read is that for active/active both HA hosts should have
> access to a same sets of disks/volumes.
> >
> > any advice ?
> >
> >
> > /Zeeshan
> >
> >
> >
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS based OSTs need advice

2018-06-26 Thread Dzmitryj Jakavuk
Hello

You can share 4 osts between pair of oss making 2 osts imported into one oss 
and 2 osts into other oss.  At the same time hdds need to be shared between all 
oss. So in normal conditions  1 oss will import 2 ost and the second oss will 
import   Other 2 osts.in case of ha single oss can import all 4osts

Kind Regards
Dzmitryj Jakavuk

> On Jun 26, 2018, at 16:02, Zeeshan Ali Shah  wrote:
> 
> We have 2 OSS with 4 OST shared . Each OST has 90 Disk so total 360 Disks . 
> 
> I am in phase of installing 2OSS as active/active but as zfs pools can only 
> be imported in single OSS host in this case how to achieve active/active HA ?
> As what i read is that for active/active both HA hosts should have access to 
> a same sets of disks/volumes. 
> 
> any advice ?
> 
> 
> /Zeeshan
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS based OSTs need advice

2018-06-26 Thread Nathaniel Clark


I assume you mean, that your storage is active-active and cross-
connected between your OSS's. And you want both OSS to be able to
present any of the 4 OSTs.

Each OST should be it's own zpool, this will let each OST be imported /
failed between OSS's independently.

There's a guide on how to setup Pacemaker v1.1 (default for el7) to do
failover with ZFS and Lustre 2.10+
https://wiki.whamcloud.com/display/PUB/Using+Pacemaker+1.1+with+a+Lustr
e+File+System

--
Nathaniel Clark

On Tue, 2018-06-26 at 16:02 +0300, Zeeshan Ali Shah wrote:
> We have 2 OSS with 4 OST shared . Each OST has 90 Disk so total 360
> Disks . 
> I am in phase of installing 2OSS as active/active but as zfs pools
> can only be imported in single OSS host in this case how to achieve
> active/active HA ?
> As what i read is that for active/active both HA hosts should have
> access to a same sets of disks/volumes. 
> 
> any advice ?
> 
> 
> /Zeeshan
> 
> 
> 
> 
> 
> ___lustre-discuss mailing
> listlustre-discuss@lists.lustre.orghttp://lists.lustre.org/listinfo.c
> gi/lustre-discuss-lustre.org
Zeeshan,___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS based OSTs need advice

2018-06-26 Thread Zeeshan Ali Shah
We have 2 OSS with 4 OST shared . Each OST has 90 Disk so total 360 Disks .

I am in phase of installing 2OSS as active/active but as zfs pools can only
be imported in single OSS host in this case how to achieve active/active HA
?
As what i read is that for active/active both HA hosts should have access
to a same sets of disks/volumes.

any advice ?


/Zeeshan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs has native dnode accounting supported... no

2018-05-17 Thread Hans Henrik Happe
Thanks Andreas,

The patch works for me.

Cheers,
Hans Henrik

On 16-05-2018 10:55, Dilger, Andreas wrote:
> On May 16, 2018, at 00:22, Hans Henrik Happe  wrote:
>>
>> When building 2.10.4-RC1 on CentOS 7.5 I noticed this during configure:
>>
>> zfs has native dnode accounting supported... no
>>
>> I'm using the kmod version of ZFS 0.7.9 from the official repos.
>> Shouldn't native dnode accounting work with these versions?
>>
>> Is there a way to detect if a Lustre filesystem is using native dnode
>> accounting?
> 
> This looks like a bug.  The Lustre code was changed to detect ZFS project
> quota (which has a different function signature in ZFS 0.8) but isn't
> included in the ZFS 0.7.x releases, but lost the ability to detect the old
> dnode accounting function signature.
> 
> I've pushed patch https://review.whamcloud.com/32418 that should fix this.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
> 
> 
> 
> 
> 
> 
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs has native dnode accounting supported... no

2018-05-16 Thread Dilger, Andreas
On May 16, 2018, at 00:22, Hans Henrik Happe  wrote:
> 
> When building 2.10.4-RC1 on CentOS 7.5 I noticed this during configure:
> 
> zfs has native dnode accounting supported... no
> 
> I'm using the kmod version of ZFS 0.7.9 from the official repos.
> Shouldn't native dnode accounting work with these versions?
> 
> Is there a way to detect if a Lustre filesystem is using native dnode
> accounting?

This looks like a bug.  The Lustre code was changed to detect ZFS project
quota (which has a different function signature in ZFS 0.8) but isn't
included in the ZFS 0.7.x releases, but lost the ability to detect the old
dnode accounting function signature.

I've pushed patch https://review.whamcloud.com/32418 that should fix this.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] zfs has native dnode accounting supported... no

2018-05-16 Thread Hans Henrik Happe
Hi,

When building 2.10.4-RC1 on CentOS 7.5 I noticed this during configure:

zfs has native dnode accounting supported... no

I'm using the kmod version of ZFS 0.7.9 from the official repos.
Shouldn't native dnode accounting work with these versions?

Is there a way to detect if a Lustre filesystem is using native dnode
accounting?

Cheers,
Hans Henrik
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-28 Thread Dilger, Andreas
Having L2ARC on disks has no benefit at all.  It only makes sense if the L2ARC 
devices are on much faster storage (i.e. SSDs/NVMe) than the rest of the pool.  
Otherwise, the data could just be read from the disks directly.

Cheers, Andreas

On Oct 26, 2017, at 10:13, Mannthey, Keith <keith.mannt...@intel.com> wrote:
> 
> I have seen both small and large OST work it just depends on what you want in 
> the system (Size/Performance/Manageability). Do benchmark both as they will 
> differ in overall performance some. 
> 
> L2arc read cache can help some workloads.  It takes multi reads for data to 
> be moved into the cache so standard benchmarking (IOR and other streaming 
> benchmarks) won't see much of a change.  
> 
> Thanks,
> Keith 
> 
> -Original Message-
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
> Behalf Of Thomas Roth
> Sent: Thursday, October 26, 2017 1:50 AM
> To: Lustre Discuss <lustre-discuss@lists.lustre.org>
> Subject: Re: [lustre-discuss] ZFS-OST layout, number of OSTs
> 
> On the other hand if we gather three or four raidz2s into one zpool/OST, loss 
> of one raidz means loss of a 120-160TB OST.
> Around here, this is usually the deciding argument. (Even temporarily taking 
> down one OST for whatever repairs would take more data offline).
> 
> 
> How is the general experience with having an l2arc on additional disks?
> In my test attempts I did not see much benefit under Lustre.
> 
> With our type of hardware, we do not have room for one drive per (small) 
> zpool - if there were only one or two zpools per box, this would be possible.
> 
> Regards
> Thomas
> 
> On 10/24/2017 09:41 PM, Cory Spitz wrote:
>> It’s also worth noting that if you have small OSTs it’s much easier to bump 
>> into a full OST situation.   And specifically, if you singly stripe a file 
>> the file size is limited by the size of the OST.
>> 
>> -Cory
>> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-26 Thread Mannthey, Keith
I have seen both small and large OST work it just depends on what you want in 
the system (Size/Performance/Manageability). Do benchmark both as they will 
differ in overall performance some. 

L2arc read cache can help some workloads.  It takes multi reads for data to be 
moved into the cache so standard benchmarking (IOR and other streaming 
benchmarks) won't see much of a change.  

Thanks,
 Keith 

-Original Message-
From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Thomas Roth
Sent: Thursday, October 26, 2017 1:50 AM
To: Lustre Discuss <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] ZFS-OST layout, number of OSTs

On the other hand if we gather three or four raidz2s into one zpool/OST, loss 
of one raidz means loss of a 120-160TB OST.
Around here, this is usually the deciding argument. (Even temporarily taking 
down one OST for whatever repairs would take more data offline).


How is the general experience with having an l2arc on additional disks?
In my test attempts I did not see much benefit under Lustre.

With our type of hardware, we do not have room for one drive per (small) zpool 
- if there were only one or two zpools per box, this would be possible.

Regards
Thomas

On 10/24/2017 09:41 PM, Cory Spitz wrote:
> It’s also worth noting that if you have small OSTs it’s much easier to bump 
> into a full OST situation.   And specifically, if you singly stripe a file 
> the file size is limited by the size of the OST.
> 
> -Cory
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-26 Thread Thomas Roth
On the other hand if we gather three or four raidz2s into one zpool/OST, loss of one raidz means loss 
of a 120-160TB OST.
Around here, this is usually the deciding argument. (Even temporarily taking down one OST for whatever 
repairs would take more data offline).



How is the general experience with having an l2arc on additional disks?
In my test attempts I did not see much benefit under Lustre.

With our type of hardware, we do not have room for one drive per (small) zpool - if there were only 
one or two zpools per box, this would be possible.


Regards
Thomas

On 10/24/2017 09:41 PM, Cory Spitz wrote:

It’s also worth noting that if you have small OSTs it’s much easier to bump 
into a full OST situation.   And specifically, if you singly stripe a file the 
file size is limited by the size of the OST.

-Cory


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-24 Thread Patrick Farrell
It can be pretty easily inferred from the nature of the feature.


If a decent policy is written and applied to all files (starting with few 
stripes and going to many as size increases), then it will resolve the problem 
of large files on single OSTs.  If the policy is not universally applied or is 
poorly constructed, you may have issues.


Otherwise, as long as users are not restricting file creation to a single or 
small # of OSTs, then there's not really any way for them to fill up a single 
OST without filling up all of them.


- Patrick


From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Mark Hahn <h...@mcmaster.ca>
Sent: Tuesday, October 24, 2017 3:21:47 PM
To: Lustre Discuss
Subject: Re: [lustre-discuss] ZFS-OST layout, number of OSTs

> It?s also worth noting that if you have small OSTs it?s much easier to bump
>into a full OST situation.   And specifically, if you singly stripe a file
>the file size is limited by the size of the OST.

is there enough real-life experience to know whether
progressive file layout will mitigate this issue?

thanks,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
   | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
   | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-24 Thread Mark Hahn

It?s also worth noting that if you have small OSTs it?s much easier to bump
into a full OST situation.   And specifically, if you singly stripe a file
the file size is limited by the size of the OST.


is there enough real-life experience to know whether 
progressive file layout will mitigate this issue?


thanks,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
  | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
  | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-24 Thread Cory Spitz
It’s also worth noting that if you have small OSTs it’s much easier to bump 
into a full OST situation.   And specifically, if you singly stripe a file the 
file size is limited by the size of the OST.

-Cory

-- 


On 10/22/17, 1:39 PM, "lustre-discuss on behalf of Thomas Roth" 
<lustre-discuss-boun...@lists.lustre.org on behalf of t.r...@gsi.de> wrote:

Hi Patrick,

thanks for the clarification. One thing less to worry about.
Since our users mainly do small file I/O, it would seem that the random I/O 
iops numbers are the relevant quantity -and there the smaller OSTs are the 
fast ones ;-)

Cheers
Thomas

On 22.10.2017 20:21, Patrick Farrell wrote:
> Thomas,
> 
> This is likely a reflection of an older issue, since resolved.  For a 
long time, Lustre reserved max_rpcs_in_flight*max_pages_per_rpc for each OST 
(on the client).  This was a huge memory commitment in larger setups, but was 
resolved a few versions back, and now per OST memory usage on the client is 
pretty trivial when the client isn’t doing I/o to that OST.  The main arguments 
against large OST counts are probably the pain of managing larger numbers of 
them, and individual OSTs being slow (because they use fewer disks), requiring 
users to stripe files more widely to see the benefit.  This is both an 
administrative burden for users and uses more space on the metadata server to 
track the file layouts.
> 
> But if your MDT is large and your users amenable to thinking about that 
(or you set a good default striping policy - progressive file layouts from 2.10 
are wonderful for this), then it’s probably fine.  The largest OST counts I am 
aware of are in the low thousands.
> 
> Ah, one more thing - clients must ping every OST periodically if they 
haven’t otherwise contacted it within the required interval.  This can 
contribute to network traffic and CPU noise/jitter on the clients.  I don’t 
have a good sense of how serious this is in practice, but I know some larger 
sites worry about it.
> 
> - Patrick
> 
> 
> 
> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf 
of Thomas Roth <t.r...@gsi.de>
> Sent: Sunday, October 22, 2017 9:04:35 AM
> To: Lustre Discuss
> Subject: [lustre-discuss] ZFS-OST layout, number of OSTs
> 
> Hi all,
> 
> I have done some "fio" benchmarking, amongst other things to test the 
proposition that to get more iops, the number of disks per raidz should be less.
> I was happy I could reproduce that: one server with 30 disks in one 
raidz2 (=one zpool = one OST) is indeed slower than one with 30 disks in three
> raidz2 (one zpool, one OST).
> I ran fio also on a third server were the 30 disks make up 3 raidz2 = 3 
zpools = 3 OSTs, that one is faster still.
> 
> Now I seem to remember a warning not to have too many OSTs in one Lustre, 
because each OST eats some memory on the client. I haven't found that
> reference, and I would like to ask what the critical numbers might be? 
How much RAM are we talking about? Is there any other "wise" limit on the OST
> number?
> Currently our clients are equipped with 128 or 256 GB RAM.  We have 550 
OSTs in the system, but the next cluster could easily grow much larger here if
> we stick to the small OSTs.
> 
> Regards,
> Thomas
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 

-- 

Thomas Roth
Department: HPC
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Professor Dr. Paolo Giubellino
Ursula Weyrich
Jörg Blaurock

Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-22 Thread Thomas Roth

Hi Patrick,

thanks for the clarification. One thing less to worry about.
Since our users mainly do small file I/O, it would seem that the random I/O iops numbers are the relevant quantity -and there the smaller OSTs are the 
fast ones ;-)


Cheers
Thomas

On 22.10.2017 20:21, Patrick Farrell wrote:

Thomas,

This is likely a reflection of an older issue, since resolved.  For a long 
time, Lustre reserved max_rpcs_in_flight*max_pages_per_rpc for each OST (on the 
client).  This was a huge memory commitment in larger setups, but was resolved 
a few versions back, and now per OST memory usage on the client is pretty 
trivial when the client isn’t doing I/o to that OST.  The main arguments 
against large OST counts are probably the pain of managing larger numbers of 
them, and individual OSTs being slow (because they use fewer disks), requiring 
users to stripe files more widely to see the benefit.  This is both an 
administrative burden for users and uses more space on the metadata server to 
track the file layouts.

But if your MDT is large and your users amenable to thinking about that (or you 
set a good default striping policy - progressive file layouts from 2.10 are 
wonderful for this), then it’s probably fine.  The largest OST counts I am 
aware of are in the low thousands.

Ah, one more thing - clients must ping every OST periodically if they haven’t 
otherwise contacted it within the required interval.  This can contribute to 
network traffic and CPU noise/jitter on the clients.  I don’t have a good sense 
of how serious this is in practice, but I know some larger sites worry about it.

- Patrick



From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Thomas 
Roth <t.r...@gsi.de>
Sent: Sunday, October 22, 2017 9:04:35 AM
To: Lustre Discuss
Subject: [lustre-discuss] ZFS-OST layout, number of OSTs

Hi all,

I have done some "fio" benchmarking, amongst other things to test the 
proposition that to get more iops, the number of disks per raidz should be less.
I was happy I could reproduce that: one server with 30 disks in one raidz2 
(=one zpool = one OST) is indeed slower than one with 30 disks in three
raidz2 (one zpool, one OST).
I ran fio also on a third server were the 30 disks make up 3 raidz2 = 3 zpools 
= 3 OSTs, that one is faster still.

Now I seem to remember a warning not to have too many OSTs in one Lustre, 
because each OST eats some memory on the client. I haven't found that
reference, and I would like to ask what the critical numbers might be? How much RAM are 
we talking about? Is there any other "wise" limit on the OST
number?
Currently our clients are equipped with 128 or 256 GB RAM.  We have 550 OSTs in 
the system, but the next cluster could easily grow much larger here if
we stick to the small OSTs.

Regards,
Thomas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



--

Thomas Roth
Department: HPC
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Professor Dr. Paolo Giubellino
Ursula Weyrich
Jörg Blaurock

Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-22 Thread Patrick Farrell
Thomas,

This is likely a reflection of an older issue, since resolved.  For a long 
time, Lustre reserved max_rpcs_in_flight*max_pages_per_rpc for each OST (on the 
client).  This was a huge memory commitment in larger setups, but was resolved 
a few versions back, and now per OST memory usage on the client is pretty 
trivial when the client isn’t doing I/o to that OST.  The main arguments 
against large OST counts are probably the pain of managing larger numbers of 
them, and individual OSTs being slow (because they use fewer disks), requiring 
users to stripe files more widely to see the benefit.  This is both an 
administrative burden for users and uses more space on the metadata server to 
track the file layouts.

But if your MDT is large and your users amenable to thinking about that (or you 
set a good default striping policy - progressive file layouts from 2.10 are 
wonderful for this), then it’s probably fine.  The largest OST counts I am 
aware of are in the low thousands.

Ah, one more thing - clients must ping every OST periodically if they haven’t 
otherwise contacted it within the required interval.  This can contribute to 
network traffic and CPU noise/jitter on the clients.  I don’t have a good sense 
of how serious this is in practice, but I know some larger sites worry about it.

- Patrick



From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Thomas Roth <t.r...@gsi.de>
Sent: Sunday, October 22, 2017 9:04:35 AM
To: Lustre Discuss
Subject: [lustre-discuss] ZFS-OST layout, number of OSTs

Hi all,

I have done some "fio" benchmarking, amongst other things to test the 
proposition that to get more iops, the number of disks per raidz should be less.
I was happy I could reproduce that: one server with 30 disks in one raidz2 
(=one zpool = one OST) is indeed slower than one with 30 disks in three
raidz2 (one zpool, one OST).
I ran fio also on a third server were the 30 disks make up 3 raidz2 = 3 zpools 
= 3 OSTs, that one is faster still.

Now I seem to remember a warning not to have too many OSTs in one Lustre, 
because each OST eats some memory on the client. I haven't found that
reference, and I would like to ask what the critical numbers might be? How much 
RAM are we talking about? Is there any other "wise" limit on the OST
number?
Currently our clients are equipped with 128 or 256 GB RAM.  We have 550 OSTs in 
the system, but the next cluster could easily grow much larger here if
we stick to the small OSTs.

Regards,
Thomas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS PANIC

2017-02-18 Thread Bob Ball
Yes, it sure sounds like I'm going to have to resurrect my test system 
and upgrade to the 2.9 release.  There just does not seem to be a way 
around this.


bob

On 2/18/2017 12:40 AM, Dilger, Andreas wrote:

Just a note - Lustre 2.7.58 is a random weekly development tag (like anything 
between .50 and .90), so you would be better off to update to the latest 
release (e.g. 2.8.0 or 2.9.0), which will have had much more testing.

Likewise, ZFS 0.6.4.x is quite old and many fixes have gone into ZFS 0.6.5.x.

Cheers, Andreas


On Feb 17, 2017, at 19:21, Bob Ball  wrote:

No luck, removed all files, destroyed the zpool, replaced all physical disks, 
and upon re-creation, the zfs PANIC will strike once the number of clients 
attempting accesses exceeds something just under 200.

Other OST on this OSS do not suffer from this.

Suggestions, anyone?  At this point, it seems as if it must be mdtmgs related, by the old 
"what else could it be?" argument.

Is this OST index a dead loss?  Fix this index, or destroy forever and 
introduce a new OST?

bob


On 2/13/2017 1:00 PM, Bob Ball wrote:
OK, so, I tried some new system mounts today, and each time the new client attempts to 
mount, the zfs PANIC throws.  This from 2 separate client machines.  It seems clear from 
the responsiveness problem last week that it is impacting a single OST.  After it 
happens, I power cycle the OSS because it will not shut down cleanly, and it comes back 
fine (I have pre-cycled the system where I tried the mount).  The OSS is quiet, no 
excessive traffic or load, so that does not match up with some Google searches I found on 
this, where the OSS was under heavy load, and a fix was purported to be found in an 
earlier version of this zfsonlinux.  The OST I suspect of being at the heart of this is 
always the last to finish connecting as evidenced by the "lcdl dl" count of 
connections.

As I don't know what else to do, I am draining this OST and will 
reformat/re-create it upon completion using spare disks.  It would be nice 
though if someone had a better way to fix this, or could truly point to a 
reason why this is consistently happening now.

bob



On 2/10/2017 11:23 AM, Bob Ball wrote:
Well, I find this odd, to say the least. All of this below was from yesterday, 
and persisted through a couple of reboots.  Today, shortly after I sent this, I 
found all the disks idle, but this one OST out of 6 totally unresponsive, so I 
power cycled the system, and it came up just fine.  No issues, no complaints, 
responsive  So I have no idea why this healed itself.

Can anyone enlighten me?

I _think_ that what triggered this was adding a few more client mounts of the 
lustre file system.  That's when it all went wrong. Is this helpful?  Or just a 
coincidence?  Current state:
18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403

bob


On 2/10/2017 9:39 AM, Bob Ball wrote:
Hi,

I am getting this message

PANIC: zfs: accessing past end of object 29/7 (size=33792 access=33792+128)

The affected OST seems to reject new mounts from clients now, and the lctl dl 
count of connections to the obdfilter process increases, but does not seem to 
decrease?

This is Lustre 2.7.58 with zfs 0.6.4.2

Can anyone help me diagnose and fix whatever is going wrong here? I've included 
the stack dump below.

Thanks,
bob


2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] Showing 
stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] Pid: 
24449, comm: ll_ost00_078 Tainted: P   ---
2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] Call 
Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781902]  
[] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781908]  
[] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781950]  
[] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781970]  
[] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781991]  
[] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782024]  
[] ? zfs_panic_recover+0x52/0x60 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782045]  
[] ? dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782068]  
[] ? dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782090]  
[] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782147]  
[] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782173]  
[] ? osd_write+0x1d1/0x390 [osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782206] 

Re: [lustre-discuss] ZFS PANIC

2017-02-17 Thread Dilger, Andreas
Just a note - Lustre 2.7.58 is a random weekly development tag (like anything 
between .50 and .90), so you would be better off to update to the latest 
release (e.g. 2.8.0 or 2.9.0), which will have had much more testing. 

Likewise, ZFS 0.6.4.x is quite old and many fixes have gone into ZFS 0.6.5.x. 

Cheers, Andreas

> On Feb 17, 2017, at 19:21, Bob Ball  wrote:
> 
> No luck, removed all files, destroyed the zpool, replaced all physical disks, 
> and upon re-creation, the zfs PANIC will strike once the number of clients 
> attempting accesses exceeds something just under 200.
> 
> Other OST on this OSS do not suffer from this.
> 
> Suggestions, anyone?  At this point, it seems as if it must be mdtmgs 
> related, by the old "what else could it be?" argument.
> 
> Is this OST index a dead loss?  Fix this index, or destroy forever and 
> introduce a new OST?
> 
> bob
> 
>> On 2/13/2017 1:00 PM, Bob Ball wrote:
>> OK, so, I tried some new system mounts today, and each time the new client 
>> attempts to mount, the zfs PANIC throws.  This from 2 separate client 
>> machines.  It seems clear from the responsiveness problem last week that it 
>> is impacting a single OST.  After it happens, I power cycle the OSS because 
>> it will not shut down cleanly, and it comes back fine (I have pre-cycled the 
>> system where I tried the mount).  The OSS is quiet, no excessive traffic or 
>> load, so that does not match up with some Google searches I found on this, 
>> where the OSS was under heavy load, and a fix was purported to be found in 
>> an earlier version of this zfsonlinux.  The OST I suspect of being at the 
>> heart of this is always the last to finish connecting as evidenced by the 
>> "lcdl dl" count of connections.
>> 
>> As I don't know what else to do, I am draining this OST and will 
>> reformat/re-create it upon completion using spare disks.  It would be nice 
>> though if someone had a better way to fix this, or could truly point to a 
>> reason why this is consistently happening now.
>> 
>> bob
>> 
>> 
>>> On 2/10/2017 11:23 AM, Bob Ball wrote:
>>> Well, I find this odd, to say the least. All of this below was from 
>>> yesterday, and persisted through a couple of reboots.  Today, shortly after 
>>> I sent this, I found all the disks idle, but this one OST out of 6 totally 
>>> unresponsive, so I power cycled the system, and it came up just fine.  No 
>>> issues, no complaints, responsive  So I have no idea why this healed 
>>> itself.
>>> 
>>> Can anyone enlighten me?
>>> 
>>> I _think_ that what triggered this was adding a few more client mounts of 
>>> the lustre file system.  That's when it all went wrong. Is this helpful?  
>>> Or just a coincidence?  Current state:
>>> 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403
>>> 
>>> bob
>>> 
 On 2/10/2017 9:39 AM, Bob Ball wrote:
 Hi,
 
 I am getting this message
 
 PANIC: zfs: accessing past end of object 29/7 (size=33792 access=33792+128)
 
 The affected OST seems to reject new mounts from clients now, and the lctl 
 dl count of connections to the obdfilter process increases, but does not 
 seem to decrease?
 
 This is Lustre 2.7.58 with zfs 0.6.4.2
 
 Can anyone help me diagnose and fix whatever is going wrong here? I've 
 included the stack dump below.
 
 Thanks,
 bob
 
 
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] 
 Showing stack for process 24449
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] 
 Pid: 24449, comm: ll_ost00_078 Tainted: P   ---
 2.6.32.504.16.2.el6_lustre #7
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] 
 Call Trace:
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781902]  
 [] ? spl_dumpstack+0x3d/0x40 [spl]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781908]  
 [] ? vcmn_err+0x8d/0xf0 [spl]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781950]  
 [] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781970]  
 [] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781991]  
 [] ? dbuf_read+0x5ca/0x8a0 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782024]  
 [] ? zfs_panic_recover+0x52/0x60 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782045]  
 [] ? dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782068]  
 [] ? dmu_buf_hold_array+0x65/0x90 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782090]  
 [] ? dmu_write+0x68/0x1a0 [zfs]
 2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782147]  
 [] ? lprocfs_oh_tally+0x2e/0x50 

Re: [lustre-discuss] ZFS PANIC

2017-02-17 Thread Bob Ball
No luck, removed all files, destroyed the zpool, replaced all physical 
disks, and upon re-creation, the zfs PANIC will strike once the number 
of clients attempting accesses exceeds something just under 200.


Other OST on this OSS do not suffer from this.

Suggestions, anyone?  At this point, it seems as if it must be mdtmgs 
related, by the old "what else could it be?" argument.


Is this OST index a dead loss?  Fix this index, or destroy forever and 
introduce a new OST?


bob

On 2/13/2017 1:00 PM, Bob Ball wrote:
OK, so, I tried some new system mounts today, and each time the new 
client attempts to mount, the zfs PANIC throws.  This from 2 separate 
client machines.  It seems clear from the responsiveness problem last 
week that it is impacting a single OST.  After it happens, I power 
cycle the OSS because it will not shut down cleanly, and it comes back 
fine (I have pre-cycled the system where I tried the mount).  The OSS 
is quiet, no excessive traffic or load, so that does not match up with 
some Google searches I found on this, where the OSS was under heavy 
load, and a fix was purported to be found in an earlier version of 
this zfsonlinux.  The OST I suspect of being at the heart of this is 
always the last to finish connecting as evidenced by the "lcdl dl" 
count of connections.


As I don't know what else to do, I am draining this OST and will 
reformat/re-create it upon completion using spare disks.  It would be 
nice though if someone had a better way to fix this, or could truly 
point to a reason why this is consistently happening now.


bob


On 2/10/2017 11:23 AM, Bob Ball wrote:
Well, I find this odd, to say the least. All of this below was from 
yesterday, and persisted through a couple of reboots.  Today, shortly 
after I sent this, I found all the disks idle, but this one OST out 
of 6 totally unresponsive, so I power cycled the system, and it came 
up just fine.  No issues, no complaints, responsive  So I have no 
idea why this healed itself.


Can anyone enlighten me?

I _think_ that what triggered this was adding a few more client 
mounts of the lustre file system.  That's when it all went wrong. Is 
this helpful?  Or just a coincidence?  Current state:

 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403

bob

On 2/10/2017 9:39 AM, Bob Ball wrote:

Hi,

I am getting this message

PANIC: zfs: accessing past end of object 29/7 (size=33792 
access=33792+128)


The affected OST seems to reject new mounts from clients now, and 
the lctl dl count of connections to the obdfilter process increases, 
but does not seem to decrease?


This is Lustre 2.7.58 with zfs 0.6.4.2

Can anyone help me diagnose and fix whatever is going wrong here? 
I've included the stack dump below.


Thanks,
bob


2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781874] Showing stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781876] Pid: 24449, comm: ll_ost00_078 Tainted: 
P   ---2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781878] Call Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781902]  [] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781908]  [] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781950]  [] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781970]  [] ? 
dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781991]  [] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782024]  [] ? 
zfs_panic_recover+0x52/0x60 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782045]  [] ? 
dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782068]  [] ? 
dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782090]  [] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782147]  [] ? lprocfs_oh_tally+0x2e/0x50 
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782173]  [] ? osd_write+0x1d1/0x390 
[osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782206]  [] ? dt_record_write+0x3d/0x130 
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782305]  [] ? 
tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782347]  [] ? 
tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782388]  [] ? tgt_client_new+0x3d8/0x6a0 
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782407]  [] ? 
ofd_obd_connect+0x363/0x400 [ofd]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 

Re: [lustre-discuss] ZFS PANIC

2017-02-13 Thread Bob Ball
OK, so, I tried some new system mounts today, and each time the new 
client attempts to mount, the zfs PANIC throws.  This from 2 separate 
client machines.  It seems clear from the responsiveness problem last 
week that it is impacting a single OST.  After it happens, I power cycle 
the OSS because it will not shut down cleanly, and it comes back fine (I 
have pre-cycled the system where I tried the mount).  The OSS is quiet, 
no excessive traffic or load, so that does not match up with some Google 
searches I found on this, where the OSS was under heavy load, and a fix 
was purported to be found in an earlier version of this zfsonlinux.  The 
OST I suspect of being at the heart of this is always the last to finish 
connecting as evidenced by the "lcdl dl" count of connections.


As I don't know what else to do, I am draining this OST and will 
reformat/re-create it upon completion using spare disks.  It would be 
nice though if someone had a better way to fix this, or could truly 
point to a reason why this is consistently happening now.


bob


On 2/10/2017 11:23 AM, Bob Ball wrote:
Well, I find this odd, to say the least.  All of this below was from 
yesterday, and persisted through a couple of reboots.  Today, shortly 
after I sent this, I found all the disks idle, but this one OST out of 
6 totally unresponsive, so I power cycled the system, and it came up 
just fine.  No issues, no complaints, responsive  So I have no 
idea why this healed itself.


Can anyone enlighten me?

I _think_ that what triggered this was adding a few more client mounts 
of the lustre file system.  That's when it all went wrong. Is this 
helpful?  Or just a coincidence?  Current state:

 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403

bob

On 2/10/2017 9:39 AM, Bob Ball wrote:

Hi,

I am getting this message

PANIC: zfs: accessing past end of object 29/7 (size=33792 
access=33792+128)


The affected OST seems to reject new mounts from clients now, and the 
lctl dl count of connections to the obdfilter process increases, but 
does not seem to decrease?


This is Lustre 2.7.58 with zfs 0.6.4.2

Can anyone help me diagnose and fix whatever is going wrong here? 
I've included the stack dump below.


Thanks,
bob


2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781874] Showing stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781876] Pid: 24449, comm: ll_ost00_078 Tainted: P   
---2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781878] Call Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781902]  [] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781908]  [] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781950]  [] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781970]  [] ? 
dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781991]  [] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782024]  [] ? zfs_panic_recover+0x52/0x60 
[zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782045]  [] ? 
dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782068]  [] ? 
dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782090]  [] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782147]  [] ? lprocfs_oh_tally+0x2e/0x50 
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782173]  [] ? osd_write+0x1d1/0x390 
[osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782206]  [] ? dt_record_write+0x3d/0x130 
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782305]  [] ? 
tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782347]  [] ? 
tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782388]  [] ? tgt_client_new+0x3d8/0x6a0 
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782407]  [] ? ofd_obd_connect+0x363/0x400 
[ofd]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782443]  [] ? 
target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782450]  [] ? enqueue_entity+0x125/0x450
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782457]  [] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782462]  [] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782481]  [] ? 
lc_watchdog_touch+0x7a/0x190 [libcfs]

Re: [lustre-discuss] ZFS PANIC

2017-02-10 Thread Bob Ball
Well, I find this odd, to say the least.  All of this below was from 
yesterday, and persisted through a couple of reboots.  Today, shortly 
after I sent this, I found all the disks idle, but this one OST out of 6 
totally unresponsive, so I power cycled the system, and it came up just 
fine.  No issues, no complaints, responsive  So I have no idea why 
this healed itself.


Can anyone enlighten me?

I _think_ that what triggered this was adding a few more client mounts 
of the lustre file system.  That's when it all went wrong. Is this 
helpful?  Or just a coincidence?  Current state:

 18 UP obdfilter umt3B-OST000f umt3B-OST000f_UUID 403

bob

On 2/10/2017 9:39 AM, Bob Ball wrote:

Hi,

I am getting this message

PANIC: zfs: accessing past end of object 29/7 (size=33792 
access=33792+128)


The affected OST seems to reject new mounts from clients now, and the 
lctl dl count of connections to the obdfilter process increases, but 
does not seem to decrease?


This is Lustre 2.7.58 with zfs 0.6.4.2

Can anyone help me diagnose and fix whatever is going wrong here? I've 
included the stack dump below.


Thanks,
bob


2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] 
Showing stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] 
Pid: 24449, comm: ll_ost00_078 Tainted: P   ---
2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] 
Call Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781902]  [] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781908]  [] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781950]  [] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781970]  [] ? 
dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.781991]  [] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782024]  [] ? zfs_panic_recover+0x52/0x60 
[zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782045]  [] ? 
dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782068]  [] ? dmu_buf_hold_array+0x65/0x90 
[zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782090]  [] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782147]  [] ? lprocfs_oh_tally+0x2e/0x50 
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782173]  [] ? osd_write+0x1d1/0x390 [osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782206]  [] ? dt_record_write+0x3d/0x130 
[obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782305]  [] ? 
tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782347]  [] ? 
tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782388]  [] ? tgt_client_new+0x3d8/0x6a0 
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782407]  [] ? ofd_obd_connect+0x363/0x400 
[ofd]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782443]  [] ? 
target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782450]  [] ? enqueue_entity+0x125/0x450
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782457]  [] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782462]  [] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782481]  [] ? lc_watchdog_touch+0x7a/0x190 
[libcfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782524]  [] ? 
tgt_request_handle+0x5b2/0x1230 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782564]  [] ? ptlrpc_main+0xe41/0x1920 
[ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782570]  [] ? sched_clock+0x9/0x10
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782576]  [] ? thread_return+0x4e/0x7d0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782615]  [] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782622]  [] ? kthread+0x9e/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782626]  [] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782632]  [] ? child_rip+0xa/0x20
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782636]  [] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: 
[11630254.782641]  [] ? child_rip+0x0/0x20



Later, that same process showed:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 

[lustre-discuss] ZFS PANIC

2017-02-10 Thread Bob Ball

Hi,

I am getting this message

PANIC: zfs: accessing past end of object 29/7 (size=33792 access=33792+128)

The affected OST seems to reject new mounts from clients now, and the 
lctl dl count of connections to the obdfilter process increases, but 
does not seem to decrease?


This is Lustre 2.7.58 with zfs 0.6.4.2

Can anyone help me diagnose and fix whatever is going wrong here? I've 
included the stack dump below.


Thanks,
bob


2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781874] 
Showing stack for process 24449
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781876] 
Pid: 24449, comm: ll_ost00_078 Tainted: P   ---
2.6.32.504.16.2.el6_lustre #7
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781878] 
Call Trace:
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781902]  
[] ? spl_dumpstack+0x3d/0x40 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781908]  
[] ? vcmn_err+0x8d/0xf0 [spl]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781950]  
[] ? RW_WRITE_HELD+0x66/0xb0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781970]  
[] ? dbuf_rele_and_unlock+0x268/0x3f0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.781991]  
[] ? dbuf_read+0x5ca/0x8a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782024]  
[] ? zfs_panic_recover+0x52/0x60 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782045]  
[] ? dmu_buf_hold_array_by_dnode+0x41b/0x560 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782068]  
[] ? dmu_buf_hold_array+0x65/0x90 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782090]  
[] ? dmu_write+0x68/0x1a0 [zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782147]  
[] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782173]  
[] ? osd_write+0x1d1/0x390 [osd_zfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782206]  
[] ? dt_record_write+0x3d/0x130 [obdclass]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782305]  
[] ? tgt_client_data_write+0x165/0x1b0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782347]  
[] ? tgt_client_data_update+0x335/0x680 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782388]  
[] ? tgt_client_new+0x3d8/0x6a0 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782407]  
[] ? ofd_obd_connect+0x363/0x400 [ofd]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782443]  
[] ? target_handle_connect+0xe58/0x2d30 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782450]  
[] ? enqueue_entity+0x125/0x450
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782457]  
[] ? check_preempt_curr+0x7c/0x90
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782462]  
[] ? try_to_wake_up+0x24e/0x3e0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782481]  
[] ? lc_watchdog_touch+0x7a/0x190 [libcfs]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782524]  
[] ? tgt_request_handle+0x5b2/0x1230 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782564]  
[] ? ptlrpc_main+0xe41/0x1920 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782570]  
[] ? sched_clock+0x9/0x10
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782576]  
[] ? thread_return+0x4e/0x7d0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782615]  
[] ? ptlrpc_main+0x0/0x1920 [ptlrpc]
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782622]  
[] ? kthread+0x9e/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782626]  
[] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782632]  
[] ? child_rip+0xa/0x20
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782636]  
[] ? kthread+0x0/0xc0
2017-02-08T23:02:23-05:00 umdist01.aglt2.org kernel: [11630254.782641]  
[] ? child_rip+0x0/0x20



Later, that same process showed:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773156] 
LNet: Service thread pid 24449 was inactive for 200.00s. The thread 
might be hung, or it might only be slow and will resume later. Dumping 
the stack trace for debugging purposes:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773163] 
Pid: 24449, comm: ll_ost00_078

2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773164]
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773165] 
Call Trace:
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773181]  
[] ? show_trace_log_lvl+0x55/0x70
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: [11630454.773194]  
[] ? dump_stack+0x6f/0x76
2017-02-08T23:05:43-05:00 umdist01.aglt2.org kernel: 

Re: [lustre-discuss] ZFS not freeing disk space

2016-08-10 Thread Alexander I Kulyavtsev
"Deleting Files Doesn't Free Space"
   https://github.com/zfsonlinux/zfs/issues/1188

"Deleting Files Doesn't Free Space, unless I unmount the filesystem"
https://github.com/zfsonlinux/zfs/issues/1548

There are more references listed on pages above.

Since 0.6.3 the possible work around can be to set zfs xattr=sa but that does 
not help with existing files on OST.

Alex.
P.S. Apparently there's more than one way to leak the space.


On Aug 10, 2016, at 6:07 PM, Alexander I Kulyavtsev 
> wrote:

It can be zfs snapshot holding space on ost.

Or, it can be zfs issue, zfs not releasing space until reboot. Check zfs bugs 
on zfs wiki.
Lustre shows change in OST used space right away.

zfs 0.6.3 is pretty old. We are using 0.6.4.1  with lustre 2.5.3. (there is  
zfs 0.6.4.2)
You may need to patch lustre 2.5.3 to go with some zfs 0.6.5.x;  patches were 
listed on this mail list.
Alex.

On Aug 10, 2016, at 12:57 PM, Thomas Roth > 
wrote:

Hi all,

one of our ((Lustre 2.5.3, ZFS 0.6.3) OSTs got filled up to >90%, so I 
deactivated it and am now migrating files off of that OST.

Checking the list of files I am currently using, I can verify that the 
migration is working: Lustre tells me that the top of the list is already on 
some other OSTs, the bottom of the list still resides on the OST in question.

But when I do either 'lfs df' or 'df' on the OSS, and don't see any change in 
terms of bytes, while the migrated files already sum up to several GB.

Is this a special feature of ZFS, or just a symptom of a broken OST?


I think I have seen this behavior before, and the "df" result shrank to an 
expected value after the server had been rebooted. In that case, this seems 
more like a too persistent caching effect -?

Cheers,
Thomas

--

Thomas Roth
Department: Informationstechnologie
Location: SB3 1.250
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Ursula Weyrich
Professor Dr. Karlheinz Langanke
Jörg Blaurock

Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS not freeing disk space

2016-08-10 Thread Kilian Cavalotti
Hi Thomas,

On Wed, Aug 10, 2016 at 10:57 AM, Thomas Roth  wrote:
> one of our ((Lustre 2.5.3, ZFS 0.6.3) OSTs got filled up to >90%, so I
> deactivated it and am now migrating files off of that OST.
>
> But when I do either 'lfs df' or 'df' on the OSS, and don't see any change
> in terms of bytes, while the migrated files already sum up to several GB.

It's very likely because your OST is deactivated, ie. disconnected
from the MDS, and thus freed up space is not accounted for. When you
reactivate your OST, it will reconnect to the MDS, which will start
cleaning up orphan inodes (ie. inodes that still exist on the OST but
are not referenced by any file on the MDT anymore). You should see
messages like "lustre-OST: deleting orphan objects from
0x0:180570872 to 0x0:180570891" when this happens.

That's actually how it's supposed to work, but there are some
limitations in 2.5 that may require a restart of the MDS. See
https://jira.hpdd.intel.com/browse/LU-7012 for details.

And of course, as soon as you re-activate your OST, new files will be
created on it, so it may skew the counters the other way.
But AFAIK, it's not specific to ZFS at all.

Cheers,
-- 
Kilian
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS not freeing disk space

2016-08-10 Thread Alexander I Kulyavtsev
It can be zfs snapshot holding space on ost.

Or, it can be zfs issue, zfs not releasing space until reboot. Check zfs bugs 
on zfs wiki.
Lustre shows change in OST used space right away.

zfs 0.6.3 is pretty old. We are using 0.6.4.1  with lustre 2.5.3. (there is  
zfs 0.6.4.2)
You may need to patch lustre 2.5.3 to go with some zfs 0.6.5.x;  patches were 
listed on this mail list. 
Alex.

On Aug 10, 2016, at 12:57 PM, Thomas Roth  wrote:

> Hi all,
> 
> one of our ((Lustre 2.5.3, ZFS 0.6.3) OSTs got filled up to >90%, so I 
> deactivated it and am now migrating files off of that OST.
> 
> Checking the list of files I am currently using, I can verify that the 
> migration is working: Lustre tells me that the top of the list is already on 
> some other OSTs, the bottom of the list still resides on the OST in question.
> 
> But when I do either 'lfs df' or 'df' on the OSS, and don't see any change in 
> terms of bytes, while the migrated files already sum up to several GB.
> 
> Is this a special feature of ZFS, or just a symptom of a broken OST?
> 
> 
> I think I have seen this behavior before, and the "df" result shrank to an 
> expected value after the server had been rebooted. In that case, this seems 
> more like a too persistent caching effect -?
> 
> Cheers,
> Thomas
> 
> -- 
> 
> Thomas Roth
> Department: Informationstechnologie
> Location: SB3 1.250
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
> 
> GSI Helmholtzzentrum für Schwerionenforschung GmbH
> Planckstraße 1
> 64291 Darmstadt
> www.gsi.de
> 
> Gesellschaft mit beschränkter Haftung
> Sitz der Gesellschaft: Darmstadt
> Handelsregister: Amtsgericht Darmstadt, HRB 1528
> 
> Geschäftsführung: Ursula Weyrich
> Professor Dr. Karlheinz Langanke
> Jörg Blaurock
> 
> Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte
> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS not freeing disk space

2016-08-10 Thread Bob Ball
It is my understanding that when you set the OST deactivated, they you 
also don't get updates on the used space either, as of some recent 
version of Lustre.  It has never been clear to me though if a simple 
re-activation is needed to run the logs, or if the reboot is required, 
once it is re-activated.


bob

On 8/10/2016 1:57 PM, Thomas Roth wrote:

Hi all,

one of our ((Lustre 2.5.3, ZFS 0.6.3) OSTs got filled up to >90%, so I 
deactivated it and am now migrating files off of that OST.


Checking the list of files I am currently using, I can verify that the 
migration is working: Lustre tells me that the top of the list is 
already on some other OSTs, the bottom of the list still resides on 
the OST in question.


But when I do either 'lfs df' or 'df' on the OSS, and don't see any 
change in terms of bytes, while the migrated files already sum up to 
several GB.


Is this a special feature of ZFS, or just a symptom of a broken OST?


I think I have seen this behavior before, and the "df" result shrank 
to an expected value after the server had been rebooted. In that case, 
this seems more like a too persistent caching effect -?


Cheers,
Thomas



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS not freeing disk space

2016-08-10 Thread Thomas Roth

Hi all,

one of our ((Lustre 2.5.3, ZFS 0.6.3) OSTs got filled up to >90%, so I deactivated it and am now 
migrating files off of that OST.


Checking the list of files I am currently using, I can verify that the migration is working: Lustre 
tells me that the top of the list is already on some other OSTs, the bottom of the list still resides 
on the OST in question.


But when I do either 'lfs df' or 'df' on the OSS, and don't see any change in terms of bytes, while 
the migrated files already sum up to several GB.


Is this a special feature of ZFS, or just a symptom of a broken OST?


I think I have seen this behavior before, and the "df" result shrank to an expected value after the 
server had been rebooted. In that case, this seems more like a too persistent caching effect -?


Cheers,
Thomas

--

Thomas Roth
Department: Informationstechnologie
Location: SB3 1.250
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Ursula Weyrich
Professor Dr. Karlheinz Langanke
Jörg Blaurock

Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS backed OSS out of memory

2016-06-23 Thread Carlson, Timothy S
Folks,

I've done my fair share of googling and run across some good information on ZFS 
backed Lustre tuning including this:

http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLNL-ZFS.pdf

and various discussions around how to limit (or not) the ARC and clear it if 
needed.

That being said, here is my configuration.

RHEL 6 
Kernel 2.6.32-504.3.3.el6.x86_64
ZFS 0.6.3
Lustre 2.5.3 with a couple of patches
Single OST per OSS with 4 x RAIDZ2 4TB SAS drives
Log and Cache on separate SSDs
These OSSes are beefy with 128GB of memory and Dual E5-2630 v2 CPUs

 About 30 OSSes in all serving mostly a standard HPC cluster over FDR IB with a 
sprinkle of 10G

# more /etc/modprobe.d/lustre.conf
options lnet networks=o2ib9,tcp9(eth0)

ZFS backed MDS with same software stack.

The problem I am having is the OOM killer is whacking away at system processes 
on a few of the OSSes. 

"top" shows all my memory is in use with very little Cache or Buffer usage.

Tasks: 1429 total,   5 running, 1424 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  2.9%sy,  0.0%ni, 94.0%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132270088k total, 131370888k used,   899200k free, 1828k buffers
Swap: 61407100k total, 7940k used, 61399160k free,10488k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
   47 root  RT   0 000 S 30.0  0.0 372:57.33 migration/11

I had done zero tuning so I am getting the default ARC size of 1/2 the memory.

[root@lzfs18b ~]# arcstat.py 1
time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz c
09:11:50 0 0  0 00 00 0063G   63G
09:11:51  6.2K  2.6K 41   2066  2.4K   71 0063G   63G
09:11:52   21K  4.0K 18   3052  3.7K   3418063G   63G

The question is, if I have 128GB of RAM and ARC is only taking 63, where did 
the rest go and how can I get it back so that the OOM killer stops killing me?

Thanks!

Tim


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS-lustre setup falis for me

2016-04-22 Thread Riccardo Veraldi

Hello,
I am trying to deploy a small cluster with "zfs on lustre" setup for 
initial testing a SSD solution.


I followed a few hints on the following resources:

http://zfsonlinux.org/lustre-configure-single.html
http://zfsonlinux.org/lustre.html


I am working on rhel7

I installed the following ZFS rpm:

dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm
libnvpair1-0.6.5.6-1.el7.centos.x86_64.rpm
libuutil1-0.6.5.6-1.el7.centos.x86_64.rpm
libzfs2-0.6.5.6-1.el7.centos.x86_64.rpm
libzpool2-0.6.5.6-1.el7.centos.x86_64.rpm
spl-0.6.5.6-1.el7.centos.x86_64.rpm
spl-dkms-0.6.5.6-1.el7.centos.noarch.rpm
zfs-0.6.5.6-1.el7.centos.x86_64.rpm
zfs-dkms-0.6.5.6-1.el7.centos.noarch.rpm

ZFS itself is working fine.
I can create zpools and they are working.

So I did proceed with Lustre.
I did not proceed installing the Lustre official rpm, but I went for the 
lustre rpm which are inside the ZFS on linux repo.

So I installed:

lustre-2.5.3-1zfs.x86_64.rpm
lustre-dkms-2.5.3-1zfs.el6.noarch.rpm
lustre-iokit-2.5.3-1zfs.x86_64.rpm
lustre-osd-zfs-2.5.3-1zfs.x86_64.rpm
lustre-tests-2.5.3-1zfs.x86_64.rpm

I went on configuring mds and oss.

when I create the lustre/zfs file system for mgs:

mkfs.lustre -€“mgs -€“backfstype=zfs psanatest-mgt0/mgt0 /dev/vdb
Bus error (core dumped)

Do you have any hints ? Any how-to I could/should look for ?
zfs is a requirement for me.

thank you very much


Riccardo





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS version for lustre-2.8 binaries?

2016-04-12 Thread Bob Ball
What I recall hearing is that, because of IO issues with 0.6.5, official 
Lustre 2.8.0 support is with zfs 0.6.4.2


bob

On 4/12/2016 11:21 AM, Nathan Smith wrote:

In doing a test install of vanilla Lustre 2.8 [0] and zfs-0.6.5.6 I
received the symbol mismatch error when attempting to start lustre:

 osd_zfs: disagrees about version of symbol *

I checked the release notes [1] and saw that support was bumped to
0.6.5.2, so I tried that, but received the same error. I saw in jira
that there were other subsequent versions bumps of zfs tagged for 2.8,
and it is at least up to 0.6.5.3. [2]

Long story short: I built lustre 2.8 from source against spl/zfs
0.6.5.6 and everything works well. But does anyone know the compatible
version for the "official" 2.8 binaries? Thanks.


[0] 
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.2.1511/server/
[1] https://wiki.hpdd.intel.com/display/PUB/Changelog+2.8
[2] https://jira.hpdd.intel.com/browse/LU-7316




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS version for lustre-2.8 binaries?

2016-04-12 Thread Nathan Smith
In doing a test install of vanilla Lustre 2.8 [0] and zfs-0.6.5.6 I
received the symbol mismatch error when attempting to start lustre:

osd_zfs: disagrees about version of symbol *

I checked the release notes [1] and saw that support was bumped to
0.6.5.2, so I tried that, but received the same error. I saw in jira
that there were other subsequent versions bumps of zfs tagged for 2.8,
and it is at least up to 0.6.5.3. [2]

Long story short: I built lustre 2.8 from source against spl/zfs
0.6.5.6 and everything works well. But does anyone know the compatible
version for the "official" 2.8 binaries? Thanks.


[0] 
https://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.2.1511/server/
[1] https://wiki.hpdd.intel.com/display/PUB/Changelog+2.8
[2] https://jira.hpdd.intel.com/browse/LU-7316


-- 
Nathan Smith
Research Systems Engineer
Advanced Computing Center
Oregon Health & Science University
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] ZFS inode accounting

2016-01-29 Thread Hans Henrik Happe

Hi,

I've been testing 2.7.64 on ZFS and discovered that inode accounting 
showed very high counts:


lfs quota -g others /lustre/hpc
Disk quotas for group others (gid 8000):
 Filesystem  kbytes   quota   limit   grace   files   quota   limit 
  grace
/lustre/hpc  41  1073741824 1073741824   - 
18446744073709536075   0   0   -


That looks like a 64-bit uint counter that are going below zero. It 
keeps counting down when I run mdtest with 2 or more processes. With one 
process it's the same before and after. Guess it must be a race.


The syslog on MDT shows this message:

LustreError: 8563:0:(osd_object.c:1485:osd_object_create()) hpc-MDT: 
failed to add [0x21b73:0x1cae8:0x0] to accounting ZAP for grp 8000 (-2)



After reading LU-2435 and LU-5638 I guess there are still some issues 
that needs to be sorted out?


Is there a way to make Lustre recheck the quota?

Cheers,
Hans Henrik
<>___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs and luster 2.5.3.90

2016-01-26 Thread Peter Kjellström
On Fri, 15 Jan 2016 15:28:46 +
Frederick Lefebvre  wrote:

FWIW, we at NSC run three sets of lustre+zfs in production currently:

2.5.3-5chaos + 0.6.4.1-1 + 2.6.32-504.16.2
2.5.3-something + 0.6.3-1.2 + 2.6.32-504.8.1.el6.nsc1
2.4.2-llnl13chaos + 0.6.3-1 + 2.6.32-431.20.3

> If you think you need a more recent version of ZFS, we have run Lustre
> 2.5.3 with ZFS up to 0.6.5.3 by building Lustre with patches from the
> following jiras:
...

If running 0.6.5.3 you might want to be aware that master has been
stepped back to 0.6.4:

Author: Jinshan Xiong   2015-12-23 22:14:31
Committer: Oleg Drokin   2016-01-05 19:57:08
Parent: eb6cd4804d65dda1b6ea4a1289cc01647d03a47a (LU-7223 tests: print
more information when mmp.sh failed) Child:
08fcdbb95cd7ab3fc1246f03c3ef27c0b8a0d218 (LU-7084 obd: correct some OBD
allocator macro defines) Branches: remotes/origin/iu-sk,
remotes/origin/master Follows: 2.7.64, v2_7_64, v2_7_64_0 Precedes:
2.7.65, v2_7_65, v2_7_65_0

LU-7404 zfs: reset ZFS baseline to 0.6.4.2

ZFS 0.6.5.2 is known to introduce I/O problems with the following
stack backtrace:

Call Trace:
 [] ? vdev_mirror_child_done+0x0/0x30 [zfs]
 [] io_schedule+0x73/0xc0
 [] cv_wait_common+0xaf/0x130 [spl]
 [] ? autoremove_wake_function+0x0/0x40
...

/Peter K
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs and luster 2.5.3.90

2016-01-15 Thread Frederick Lefebvre
If you think you need a more recent version of ZFS, we have run Lustre
2.5.3 with ZFS up to 0.6.5.3 by building Lustre with patches from the
following jiras:
https://jira.hpdd.intel.com/browse/LU-6152
https://jira.hpdd.intel.com/browse/LU-6459
https://jira.hpdd.intel.com/browse/LU-6816

Regards,

Frederick

On Fri, Jan 15, 2016 at 7:21 AM Dilger, Andreas 
wrote:

> On 2016/01/12, 14:21, "lustre-discuss on behalf of Kurt Strosahl"
> 
> wrote:
>
> >Hello,
> >
> >What is the highest version of zfs supported by lustre 2.5.3.90?
>
> Looks like 0.6.3, according to the "lbuild" script's SPLZFSVER.  In the
> master branch of Lustre we now add this information into lustre/ChangeLog
> along with the kernel versions and such.
>
> Cheers, Andreas
> --
> Andreas Dilger
>
> Lustre Principal Architect
> Intel High Performance Data Division
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs and luster 2.5.3.90

2016-01-15 Thread Alexander I Kulyavtsev
Frederick,
thanks for the patch list! 
It is nice to know the patch set(s) which is/are actually running in 
production. 
We are at zfs/spl 0.6.4.1 in production for the last six months with 2.5.3 last 
GA release (Sept'14).

Is tag 2.5.3.90 considered stable?
I was cautious to use 2.5.3.90 as there can be critical patches before final 
release.

There are not many differences between Intel's 2.5.3.90 and 2.5.3-llnl, if we 
disregard ldiskfs patches which we do not use. Until there are 'collateral' 
changes in the patch which are not reflected in commit message.
We did try 2.5.3-llnl build, it was working fine with 2.5.3 clients (or 1.8.9 
only). There were client crashes when we mounted both old 1.8.9 lustre and new 
lustre 2.5.3 on the same 1.8.9 client. We need that for transitional period as 
worker nodes need to access both 2.5 and 1.8 systems during migration. Last 
Intel's 2.5.3 GA release does not have this issue (+one stability patch).
We moved most of the data to the new system and reconfigured most oss/ost to 
new 2.5.3 lustre. No "in-place" conversions. Thus the issue of compatibility 
with 1.8.9 client will be not relevant when we complete migration and upgrade 
clients.

What client version shall we use with 2.5.3 servers?
2.5.3 is obvious. 
Reportedly 2.6 client has performance improvements. I build 2.7.0 client and 
rebalanced several TB with crc checks, seems OK. 
Is 2.7.0 stable? 
Shall we look on 2.8.0 client or it is too early?

Alex.

On Jan 15, 2016, at 9:28 AM, Frederick Lefebvre 
 wrote:

> If you think you need a more recent version of ZFS, we have run Lustre 2.5.3 
> with ZFS up to 0.6.5.3 by building Lustre with patches from the following 
> jiras:
> https://jira.hpdd.intel.com/browse/LU-6152
> https://jira.hpdd.intel.com/browse/LU-6459
> https://jira.hpdd.intel.com/browse/LU-6816
> 
> Regards,
> 
> Frederick
> 
> On Fri, Jan 15, 2016 at 7:21 AM Dilger, Andreas  
> wrote:
> On 2016/01/12, 14:21, "lustre-discuss on behalf of Kurt Strosahl"
> 
> wrote:
> 
> >Hello,
> >
> >What is the highest version of zfs supported by lustre 2.5.3.90?
> 
> Looks like 0.6.3, according to the "lbuild" script's SPLZFSVER.  In the
> master branch of Lustre we now add this information into lustre/ChangeLog
> along with the kernel versions and such.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> 
> Lustre Principal Architect
> Intel High Performance Data Division
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] zfs and luster 2.5.3.90

2016-01-15 Thread Kurt Strosahl
Thanks,

   That is what I was looking for.  I'd seen it mentioned in a few different 
places that higher versions of ZFS could be used.

w/r,
Kurt

- Original Message -
From: "Frederick Lefebvre" <frederick.lefeb...@calculquebec.ca>
To: "Andreas Dilger" <andreas.dil...@intel.com>, "Kurt Strosahl" 
<stros...@jlab.org>
Cc: lustre-discuss@lists.lustre.org
Sent: Friday, January 15, 2016 10:28:46 AM
Subject: Re: [lustre-discuss] zfs and luster 2.5.3.90

If you think you need a more recent version of ZFS, we have run Lustre
2.5.3 with ZFS up to 0.6.5.3 by building Lustre with patches from the
following jiras:
https://jira.hpdd.intel.com/browse/LU-6152
https://jira.hpdd.intel.com/browse/LU-6459
https://jira.hpdd.intel.com/browse/LU-6816

Regards,

Frederick

On Fri, Jan 15, 2016 at 7:21 AM Dilger, Andreas <andreas.dil...@intel.com>
wrote:

> On 2016/01/12, 14:21, "lustre-discuss on behalf of Kurt Strosahl"
> <lustre-discuss-boun...@lists.lustre.org on behalf of stros...@jlab.org>
> wrote:
>
> >Hello,
> >
> >What is the highest version of zfs supported by lustre 2.5.3.90?
>
> Looks like 0.6.3, according to the "lbuild" script's SPLZFSVER.  In the
> master branch of Lustre we now add this information into lustre/ChangeLog
> along with the kernel versions and such.
>
> Cheers, Andreas
> --
> Andreas Dilger
>
> Lustre Principal Architect
> Intel High Performance Data Division
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] zfs ? Re: interrupted tar archive of an mdt ldiskfs

2015-07-13 Thread Alexander I Kulyavtsev
What about zfs MDT backup/restore in lustre 2.5.3?

I took a look at the referenced manual pages - it tells nothing about zfs MDT 
backup. 
I believed we just use zfs send/receive in this case. Do I need to fix OI / FID 
mapping? 
Shall I run offline lfsck and wait???

Alex.


On Jul 13, 2015, at 2:09 PM, Henwood, Richard richard.henw...@intel.com wrote:

 On Mon, 2015-07-13 at 11:20 -0700, John White wrote:
 Yea, I’m benchmarking rsync right now, it doesn’t seem much faster than the 
 initial tar was at all.
 
 Can you elaborate on the risk on 2.x systems?..  
 
 
 Backing up a 2.x MDT (or OST) is described in manual for:
 
 file level (MDT only supported since 2.3):
 https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438207_21638
 
 device level:
 https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438207_71633
 
 I, personally, think it does an OK job of describing the limitations of
 file and device backups - but there is always room for improvement:
 https://wiki.hpdd.intel.com/display/PUB/Making+changes+to+the+Lustre+Manual
 
 cheers,
 Richard
 -- 
 richard.henw...@intel.com
 Tel: +1 512 410 9612
 Intel High Performance Data Division
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


  1   2   >