Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-30 Thread Vicker, Darby (JSC-EG311)
This worked great.  We implemented it on Friday and the timings of the dd test 
on our 2.9/ZFS LFS have dropped to under a second.  Thanks a lot.  The risk of 
both the client and OSS crashing within a few seconds is low enough for us 
compared to the performance gain.  

The commit you pointed to didn't apply cleanly to the 2.9 source.  Please let 
me know if you want us to upload an updated patch to that LU (or post it to 
this list).  

-Original Message-
From: "Dilger, Andreas" <andreas.dil...@intel.com>
Date: Thursday, April 27, 2017 at 6:21 PM
To: "Bass, Ned" <ba...@llnl.gov>, Darby Vicker <darby.vicke...@nasa.gov>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre 2.9 performance issues

On Apr 25, 2017, at 13:11, Bass, Ned <ba...@llnl.gov> wrote:
> 
> Hi Darby,
> 
>> -Original Message-
>> 
>> for i in $(seq 0 99) ; do
>>   dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
>> done
>> 
>> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20
>> to 60 sec on our newer 2.9 LFS.  
> 
> Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements 
> fsync() by
> waiting for an entire transaction group to get written out. This can incur 
> long
> delays on a busy filesystem as the transaction groups become quite large. Work
> on implementing ZIL support is being tracked in LU-4009 but this feature is 
> not
> expected to make it into the upcoming 2.10 release.

There is also the patch that was developed in the past to test this:
https://review.whamcloud.com/7761 "LU-4009 osd-zfs: Add tunables to disable 
sync"
which allows disabling ZFS to wait for TXG commit for each sync on the servers.

That may be an acceptable workaround in the meantime.  Essentially, clients 
would
_start_ a sync on the server, but would not wait for completion before returning
to the application.  Both the client and the OSS would need to crash within a 
few
seconds of the sync for it to be lost.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-27 Thread Jeff Johnson
While tuning can alleviate some pain it shouldn't go without mentioning
that there are some operations that are just less than optimal on a
parallel file system. I'd bet a cold one that a copy to local /tmp,
vim/paste, copy back to the LFS would've been quicker. Some single-threaded
small i/o operations can be approached more efficiently in a similar
manner.

Lustre is a fantastic tool and like most tools it doesn't do everything
well..*yet*

--Jeff

On Thu, Apr 27, 2017 at 4:21 PM, Dilger, Andreas 
wrote:

> On Apr 25, 2017, at 13:11, Bass, Ned  wrote:
> >
> > Hi Darby,
> >
> >> -Original Message-
> >>
> >> for i in $(seq 0 99) ; do
> >>   dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
> >> done
> >>
> >> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges
> from 20
> >> to 60 sec on our newer 2.9 LFS.
> >
> > Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements
> fsync() by
> > waiting for an entire transaction group to get written out. This can
> incur long
> > delays on a busy filesystem as the transaction groups become quite
> large. Work
> > on implementing ZIL support is being tracked in LU-4009 but this feature
> is not
> > expected to make it into the upcoming 2.10 release.
>
> There is also the patch that was developed in the past to test this:
> https://review.whamcloud.com/7761 "LU-4009 osd-zfs: Add tunables to
> disable sync"
> which allows disabling ZFS to wait for TXG commit for each sync on the
> servers.
>
> That may be an acceptable workaround in the meantime.  Essentially,
> clients would
> _start_ a sync on the server, but would not wait for completion before
> returning
> to the application.  Both the client and the OSS would need to crash
> within a few
> seconds of the sync for it to be lost.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
>
>
>
>
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>



-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-27 Thread Dilger, Andreas
On Apr 25, 2017, at 13:11, Bass, Ned  wrote:
> 
> Hi Darby,
> 
>> -Original Message-
>> 
>> for i in $(seq 0 99) ; do
>>   dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
>> done
>> 
>> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20
>> to 60 sec on our newer 2.9 LFS.  
> 
> Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements 
> fsync() by
> waiting for an entire transaction group to get written out. This can incur 
> long
> delays on a busy filesystem as the transaction groups become quite large. Work
> on implementing ZIL support is being tracked in LU-4009 but this feature is 
> not
> expected to make it into the upcoming 2.10 release.

There is also the patch that was developed in the past to test this:
https://review.whamcloud.com/7761 "LU-4009 osd-zfs: Add tunables to disable 
sync"
which allows disabling ZFS to wait for TXG commit for each sync on the servers.

That may be an acceptable workaround in the meantime.  Essentially, clients 
would
_start_ a sync on the server, but would not wait for completion before returning
to the application.  Both the client and the OSS would need to crash within a 
few
seconds of the sync for it to be lost.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-26 Thread Vicker, Darby (JSC-EG311)
Thanks for the kstat info.  Our 2.4 LFS is quite a bit different architecture – 
ldiskfs on a hardware RAID – so no opportunity to compare the zfs kstat info 
between the two.  Our 2.9 LFS is barely in production at this point and only a 
handful of people have moved over to it.  So its utilization is quite a bit 
lower than our 2.4 LFS.  Lustre defaults have generally worked well for us so 
we've done very little tuning.  Our 2.4 LFS uses whamcloud lustre rpms on the 
servers (i.e. no patches).  About the only tuning we've done (both on the 2.4 
and 2.9 LFS's) is this in /etc/modprobe.d/lustre.conf

options ko2iblnd map_on_demand=32

Darby


-Original Message-
From: "Bass, Ned" <ba...@llnl.gov>
Date: Tuesday, April 25, 2017 at 2:11 PM
To: Darby Vicker <darby.vicke...@nasa.gov>, "lustre-discuss@lists.lustre.org" 
<lustre-discuss@lists.lustre.org>
Subject: RE: [lustre-discuss] Lustre 2.9 performance issues

Hi Darby,

> -Original Message-
> 
> for i in $(seq 0 99) ; do
>dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
> done
> 
> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20
> to 60 sec on our newer 2.9 LFS.  

Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements fsync() 
by
waiting for an entire transaction group to get written out. This can incur long
delays on a busy filesystem as the transaction groups become quite large. Work
on implementing ZIL support is being tracked in LU-4009 but this feature is not
expected to make it into the upcoming 2.10 release.

One way to observe this on a given server is with the txgs kstat.

  echo 20 > /sys/module/zfs/parameters/zfs_txg_history # number of txgs to show
  watch cat /proc/spl/kstat/zfs/POOLNAME/txgs

Large values in the time columns (units are nanoseconds) could account for the
delays you're seeing. Conversely I'd expect to see relatively small values on 
your 2.4.3
filesystem where fsync() is returning quickly.

As to why it's slower on your newer filesystem, my first guess would be that 
it's
more heavily utilized. But that's just a guess. I'm assuming it also uses a ZFS 
backend.
Are there any other relevant tunings or patches you've applied to that system?

Ned


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.9 performance issues

2017-04-25 Thread Bass, Ned
Hi Darby,

> -Original Message-
> 
> for i in $(seq 0 99) ; do
>dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
> done
> 
> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20
> to 60 sec on our newer 2.9 LFS.  

Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements fsync() 
by
waiting for an entire transaction group to get written out. This can incur long
delays on a busy filesystem as the transaction groups become quite large. Work
on implementing ZIL support is being tracked in LU-4009 but this feature is not
expected to make it into the upcoming 2.10 release.

One way to observe this on a given server is with the txgs kstat.

  echo 20 > /sys/module/zfs/parameters/zfs_txg_history # number of txgs to show
  watch cat /proc/spl/kstat/zfs/POOLNAME/txgs

Large values in the time columns (units are nanoseconds) could account for the
delays you're seeing. Conversely I'd expect to see relatively small values on 
your 2.4.3
filesystem where fsync() is returning quickly.

As to why it's slower on your newer filesystem, my first guess would be that 
it's
more heavily utilized. But that's just a guess. I'm assuming it also uses a ZFS 
backend.
Are there any other relevant tunings or patches you've applied to that system?

Ned
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.9 performance issues

2017-04-25 Thread Vicker, Darby (JSC-EG311)
Hello,



We are having a few performance issues with our newest lustre file system.  
Here is the overview of our setup:



-) Supermicro servers connected to external 12Gb/s SAS JBODs for MDT/OSS storage

-) CentOS version = 7.3.1611 (kernel 3.10.0-514.2.2.el7.x86_64) on the servers 
and clients

-) stock kernels on servers (i.e. no lustre patches to the kernel)

-) ZFS backend

   o  version = zfs-0.6.5.8-1.el7_3.centos.x86_64

   o  ZFS metadata pool: RAID10, 12x 15,000 rpm Seagate Cheeta 12 Gb/s SAS

   o  ZFS ost pools (12x): RAIDZ3 15x 7,200 rpm HGST 8TB 12 Gb/s SAS

   o  ZFS compression on OST's (not MDT)

-) lustre version 2.9.0 (plus a patch to get dual homed servers with failover 
working properly)

-) lnet dual home config – ethernet and infiniband

-) ~400 IB clients, ~60 ethernet clients



Our large file read/write performance is excellent.  However, "everyday" 
operations (creating, moving, removing, and opening files) are noticeably 
slower as compared to our older LFS (version 2.4.3).  We are surprised at this 
since both the server hardware and software are newer and more capable.  One 
concrete thing we have discovered is that any operation that forces a flush to 
the disk is more than 10x slower on our new setup.  We discovered this when 
pasting a large amount of text into vim – this took 30+ seconds to return on a 
file that lives on our new LFS but almost no lag for a file that is on our old 
LFS, an NFS mount or a local disk. It turns out vim forces a sync the swap file 
to disk every 200 characters by default. We replicated this with a simple dd:



for i in $(seq 0 99) ; do

   dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1

done



The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20 
to 60 sec on our newer 2.9 LFS.  We’ve tried a number of ZFS tuning options 
(zfs_txg_timeout, zfs_vdev_scheduler, zfs_prefetch_disable,…) with little/no 
impact.



Additionally, we temporarily toggled sync’ing on the ZFS filesystems underlying 
our LFS:



# zfs set sync=disabled metadata/meta-fsl
# zfs set sync=disabled oss00-0/ost-fsl

# (repeat 11x for other oss/ost's)



later restoring via



# zfs set sync=standard metadata/meta-fsl

# zfs set sync=standard oss00-0/ost-fsl

# (repeat 11x for other oss/ost's)



We tried the same test on a raw zfs FS (oss00-0/testfs) in the same pool.  This 
option significantly speeds up fsync() calls to an individual ZFS filesystem 
(i.e. oss00-0/testfs shows ~10x speedup), but only marginally benefited the LFS 
fsync() behavior (~10%).



Any ideas on what might be going on here?  Any other ZFS or lustre tuning you 
would try?



Thanks,

Darby
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org