[lustre-discuss] About the data journaling mode of Lustre with ldiskfs.

2017-02-06 Thread Hanggi CUI
Hello, I'm studying the Lustre DFS in our college.

Last month, we started to benchmark the Lustre performance, between the
ordered journaling mode and the data journaling mode. We used ldiskfs as
our backend FS - which mainly based on Ext3/Ext4 (added the mount option
data=journal which ext3 defined). But we didn't get the desired results.

So I start to searching informations and analyzing the source code by
typing printk().
Then I found such things, like llite is the client VFS interface, LOD
stripe the data, osc send data to OST, fsfilt is replaced by OSD, obdfilter
named OFD which handles the read and write from OSC, etc (May not be very
accurate).
But there is no write operation functions called in ldiskfs in OST when I
write a file or running fio, so it never journal the data but only metadata.

I long for someone to tell me does lustre DFS with ldiskfs have data
journaling mode? and why?

Or I can get some more detailed documentations about this.


Thank you very much,
Hanggi.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Status of LU-8703 for Knights Landing

2017-02-06 Thread Prout, Andrew - LLSC - MITLL
Patrick,

Yes, it was a hard-stop, libcfs refused to insmod. I expect the issue would not 
appear if you have MCDRAM configured in cache mode, so it would depending on 
how you have that set up.



Thanks, I wasn't aware of the module parameter to bypass the 
problematic detection code. Using "cpu_pattern" worked nicely to bypass the 
problem.



Andrew Prout

Lincoln Laboratory Supercomputing Center

MIT Lincoln Laboratory

244 Wood Street, Lexington, MA 02420



From: Patrick Farrell [mailto:p...@cray.com]
Sent: Wednesday, February 01, 2017 4:27 PM
To: Prout, Andrew - LLSC - MITLL; lustre-discuss@lists.lustre.org
Subject: Re: Status of LU-8703 for Knights Landing



Andrew,



Are they really just not working?  I didn't see that with KNL (the default CPT 
generated without the fixes from LU-8703 is very weird, but didn't affect 
performance much - the real NUMA-ness of KNL processors seems to be minimal, 
despite the various NUMA related configuration options...), but Cray systems 
are unusual and I don't think I ever saw an empty NUMA node (possibly something 
we fix in the BIOS).  Anyway, you should be able to work around this without 
patching your client, just set some module parameters before starting 
Lustre/loading the modules.



I can think of two things which should work, both are module parameters for the 
libcfs module, I believe.  I haven't tried this, so it's possible your error is 
coming earlier in the loading process...  But I think not, based on the message.



1. Limit yourself to 1 partition, by setting cpu_npartitions to 1.

static int cpu_npartitions;

module_param(cpu_npartitions, int, 0444);

MODULE_PARM_DESC(cpu_npartitions, "# of CPU partitions");



2. Or, you could draw up a CPU partition table yourself.  Parameter name is 
cpu_pattern.



Here's the code describing that:
"

/**

 * modparam for setting CPU partitions patterns:

 *

 * i.e: "0[0,1,2,3] 1[4,5,6,7]", number before bracket is CPU partition ID,

 *  number in bracket is processor ID (core or HT)

 *

 * i.e: "N 0[0,1] 1[2,3]" the first character 'N' means numbers in bracket

 *   are NUMA node ID, number before bracket is CPU partition ID.

 *

 * i.e: "N", shortcut expression to create CPT from NUMA & CPU topology

 *

 * NB: If user specified cpu_pattern, cpu_npartitions will be ignored

 */

static char *cpu_pattern = "N";

module_param(cpu_pattern, charp, 0444);

MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");"



Notice the default pattern is N, but you can override it.



(Code references from libcfs/libcfs/linux/linux-cpu.c in Lustre.)



Either of those should let you get past the error, no need to carry patches.  I 
can't speak to the production-readiness of the patches, but I'd definitely go 
the module parameter route if it were my system.



- Patrick

  _

From: lustre-discuss 
>
 on behalf of Prout, Andrew - LLSC - MITLL 
>
Sent: Wednesday, February 1, 2017 3:11:07 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Status of LU-8703 for Knights Landing



Anyone know the production-readiness of the patches attached to LU-8703 to fix 
issues with Lustre on Xeon Phi Knights Landing hardware? We're considering 
merging them against our 2.9.0 client to get it working on our KL nodes.



Andrew Prout

Lincoln Laboratory Supercomputing Center

MIT Lincoln Laboratory

244 Wood Street, Lexington, MA 02420

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] design to enable kernel updates

2017-02-06 Thread Vicker, Darby (JSC-EG311)
Agreed.  We are just about to go into production on our next LFS with the 
setup described.  We had to get past a bug in the MGS failover for 
dual-homed servers but as of last week that is done and everything is 
working great (see "MGS failover problem" thread on this mailing list from
this month and last).  We are in the process of syncing our existing LFS
to this new one and I've failed over/rebooted/upgraded the new LFS servers
many times now to make sure we can do this in practice when the new LFS goes
into production.  Its working beautifully.  

Many thanks to the lustre developers for their continued efforts.  We have 
been using and have been fans of lustre for quite some time now and it 
just keeps getting better.  

-Original Message-
From: lustre-discuss  on behalf of Ben 
Evans 
Date: Monday, February 6, 2017 at 2:22 PM
To: Brian Andrus , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] design to enable kernel updates

It's certainly possible.  When I've done that sort of thing, you upgrade
the OS on all the servers first, boot half of them (the A side) to the new
image, all the targets will fail over to the B servers.  Once the A side
is up, reboot the B half to the new OS.  Finally, do a failback to the
"normal" running state.

At least when I've done it, you'll want to do the failovers manually so
the HA infrastructure doesn't surprise you for any reason.

-Ben

On 2/6/17, 2:54 PM, "lustre-discuss on behalf of Brian Andrus"

wrote:

>All,
>
>I have been contemplating how lustre could be configured such that I
>could update the kernel on each server without downtime.
>
>It seems this is _almost_ possible when you have a san system so you
>have failover for OSTs and MDTs. BUT the MGS/MGT seems to be the
>problematic one, since rebooting that seems cause downtime that cannot
>be avoided.
>
>If you have a system where the disks are physically part of the OSS
>hardware, you are out of luck. The hypothetical scenario I am using is
>if someone had a VM that was a qcow image on a lustre mount (basically
>an active, open file being read/written to continuously). How could
>lustre be built to ensure anyone on the VM would not notice a kernel
>upgrade to the underlying lustre servers.
>
>
>Could such a setup be done? It seems that would be a better use case for
>something like GPFS or Gluster, but being a die-hard lustre enthusiast,
>I want to at least show it could be done.
>
>
>Thanks in advance,
>
>Brian Andrus
>
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Traffic compression?

2017-02-06 Thread Ben Evans
My initial question is what are you measuring and where are you measuring it?

There are many different layers of caching happening, possibly all at the same 
time.  If you're benchmarking it's much better to figure out your max sustained 
read/write speeds than rely on peaks.

-Ben

From: lustre-discuss 
>
 on behalf of "E.S. Rosenberg" 
>
Date: Monday, February 6, 2017 at 3:25 PM
To: "lustre-discuss@lists.lustre.org" 
>
Subject: [lustre-discuss] Traffic compression?

We started closer monitoring of resources on our cluster and I noticed that 
there is sometimes a big discrepancy between the read traffic reported by 
Lustre and the incoming traffic reported by infiniband (which is the interace 
carrying the Lustre traffic).

Currently I have a 4.4GB peak on Lustre while Infiniband at the same time is 
showing just 1.4GB/s traffic (also there is a 2 minute difference between the 2 
peaks)
This is the summation of all the nodes (without the servers) in the cluster.
The stats are gathered using collectl at a 1 minute interval.

Thanks,
Eli

(There are also lots of stats that match 1:1 which makes me less sure what to 
make of this)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Traffic compression?

2017-02-06 Thread E.S. Rosenberg
We started closer monitoring of resources on our cluster and I noticed that
there is sometimes a big discrepancy between the read traffic reported by
Lustre and the incoming traffic reported by infiniband (which is the
interace carrying the Lustre traffic).

Currently I have a 4.4GB peak on Lustre while Infiniband at the same time
is showing just 1.4GB/s traffic (also there is a 2 minute difference
between the 2 peaks)
This is the summation of all the nodes (without the servers) in the cluster.
The stats are gathered using collectl at a 1 minute interval.

Thanks,
Eli

(There are also lots of stats that match 1:1 which makes me less sure what
to make of this)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] design to enable kernel updates

2017-02-06 Thread Ben Evans
It's certainly possible.  When I've done that sort of thing, you upgrade
the OS on all the servers first, boot half of them (the A side) to the new
image, all the targets will fail over to the B servers.  Once the A side
is up, reboot the B half to the new OS.  Finally, do a failback to the
"normal" running state.

At least when I've done it, you'll want to do the failovers manually so
the HA infrastructure doesn't surprise you for any reason.

-Ben

On 2/6/17, 2:54 PM, "lustre-discuss on behalf of Brian Andrus"

wrote:

>All,
>
>I have been contemplating how lustre could be configured such that I
>could update the kernel on each server without downtime.
>
>It seems this is _almost_ possible when you have a san system so you
>have failover for OSTs and MDTs. BUT the MGS/MGT seems to be the
>problematic one, since rebooting that seems cause downtime that cannot
>be avoided.
>
>If you have a system where the disks are physically part of the OSS
>hardware, you are out of luck. The hypothetical scenario I am using is
>if someone had a VM that was a qcow image on a lustre mount (basically
>an active, open file being read/written to continuously). How could
>lustre be built to ensure anyone on the VM would not notice a kernel
>upgrade to the underlying lustre servers.
>
>
>Could such a setup be done? It seems that would be a better use case for
>something like GPFS or Gluster, but being a die-hard lustre enthusiast,
>I want to at least show it could be done.
>
>
>Thanks in advance,
>
>Brian Andrus
>
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] design to enable kernel updates

2017-02-06 Thread Brian Andrus

All,

I have been contemplating how lustre could be configured such that I 
could update the kernel on each server without downtime.


It seems this is _almost_ possible when you have a san system so you 
have failover for OSTs and MDTs. BUT the MGS/MGT seems to be the 
problematic one, since rebooting that seems cause downtime that cannot 
be avoided.


If you have a system where the disks are physically part of the OSS 
hardware, you are out of luck. The hypothetical scenario I am using is 
if someone had a VM that was a qcow image on a lustre mount (basically 
an active, open file being read/written to continuously). How could 
lustre be built to ensure anyone on the VM would not notice a kernel 
upgrade to the underlying lustre servers.



Could such a setup be done? It seems that would be a better use case for 
something like GPFS or Gluster, but being a die-hard lustre enthusiast, 
I want to at least show it could be done.



Thanks in advance,

Brian Andrus

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Backup software for Lustre

2017-02-06 Thread Brett Lee
Hey Mike,

"Chapter 17" and

http://www.intel.com/content/www/us/en/lustre/backup-and-restore-training.html

both contain methods to backup & restore the entire Lustre file system.

Are you looking for a solution that backs up only the (user) data files and
their associated metadata (e.g. xattrs)?

Brett
--
Protect Yourself From Cybercrime
PDS Software Solutions LLC
https://www.TrustPDS.com 

On Mon, Feb 6, 2017 at 11:12 AM, Mike Selway  wrote:

> Hello,
>
>Anyone aware of and/or using a Backup software package to
> protect their LFS environment (not referring to the tools/scripts suggested
> in Chapter 17).
>
>
>
> Regards,
>
> Mike
>
>
>
> *Mike Selway* *|** Sr. Tiered Storage Architect | Cray Inc.*
>
> Work +1-301-332-4116 <(301)%20332-4116> | msel...@cray.com
>
> 146 Castlemaine Ct,   Castle Rock,  CO  80104 | www.cray.com
>
>
>
> [image: cid:image001.png@01CF8974.1C1FA000] 
>
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Backup software for Lustre

2017-02-06 Thread Mike Selway
Hello,
   Anyone aware of and/or using a Backup software package to 
protect their LFS environment (not referring to the tools/scripts suggested in 
Chapter 17).

Regards,
Mike

Mike Selway | Sr. Tiered Storage Architect | Cray Inc.
Work +1-301-332-4116 | msel...@cray.com
146 Castlemaine Ct,   Castle Rock,  CO  80104 | 
www.cray.com

[cid:image001.png@01CF8974.1C1FA000]
[cid:image002.jpg@01D28069.F43919F0]

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-06 Thread Oucharek, Doug S
Try running just a read test and then just a write test rather than having both 
at the same time and see if the performance goes up.

Doug

> On Feb 6, 2017, at 4:40 AM, Jon Tegner  wrote:
> 
> Hi,
> 
> I used the following script:
> 
> #!/bin/bash
> export LST_SESSION=$$
> lst new_session read/write
> lst add_group servers 10.0.12.12@o2ib
> lst add_group readers 10.0.12.11@o2ib
> lst add_group writers 10.0.12.11@o2ib
> lst add_batch bulk_rw
> lst add_test --batch bulk_rw --concurrency 12 --from readers --to servers \
> brw read check=simple size=1M
> lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
> brw write check=simple size=1M
> # start running
> lst run bulk_rw
> # display server stats for 30 seconds
> lst stat servers & sleep 30; kill $!
> # tear down
> lst end_session
> 
> and tried with concurrency from 0,2,4,8,12,16, results in
> 
> http://renget.se/lnetBandwidth.png
> and
> http://renget.se/lnetRates.png
> 
> From Bandwidth a max of just below 2800 MB/s can be noted. Since in this case 
> "readers" and "writers" are the same, I did a few tests with the line
> 
> lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
> brw write check=simple size=1M
> 
> removed from the script - which resulted in a bandwidth of around 3600 MB/s.
> 
> I also did tests using mpitests-osu_bw from openmpi, and in that case I 
> monitored a bandwidth of about 3900 MB/s.
> 
> Considering the "openmpi-bandwidth" should I be happy with the numbers 
> obtained by LNet selftest? Is there a way to modify the test so that the 
> result gets closer to what openmpi is giving? And what can be said of the 
> "Rates of servers (RPC/s)" - are they "good" or "bad"? What to compare them 
> with?
> 
> Thanks!
> 
> /jon
> 
> On 02/05/2017 08:55 PM, Jeff Johnson wrote:
>> Without seeing your entire command it is hard to say for sure but I would 
>> make sure your concurrency option is set to 8 for starters.
>> 
>> --Jeff
>> 
>> Sent from my iPhone
>> 
>>> On Feb 5, 2017, at 11:30, Jon Tegner  wrote:
>>> 
>>> Hi,
>>> 
>>> I'm trying to use lnet selftest to evaluate network performance on a test 
>>> setup (only two machines). Using e.g., iperf or Netpipe I've managed to 
>>> demonstrate the bandwidth of the underlying 10 Gbits/s network (and 
>>> typically you reach the expected bandwidth as the packet size increases).
>>> 
>>> How can I do the same using lnet selftest (i.e., verifying the bandwidth of 
>>> the underlying hardware)? My initial thought was to increase the I/O size, 
>>> but it seems the maximum size one can use is "--size=1M".
>>> 
>>> Thanks,
>>> 
>>> /jon
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-06 Thread Oucharek, Doug S
You can have larger RPCs, but those get split up into 1M LNet operations.  
Lnet-selftest works with LNet messages and not RPCs.

Doug

On Feb 5, 2017, at 3:07 PM, Patrick Farrell 
> wrote:

Doug,

It seems to me that's not true any more, with larger RPC sizes available.  Is 
there some reason that's not true?

- Patrick

From: lustre-discuss 
>
 on behalf of Oucharek, Doug S 
>
Sent: Sunday, February 5, 2017 3:18:10 PM
To: Jeff Johnson
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] LNET Self-test

Yes, you can bump your concurrency.  Size caps out at 1M because that is how 
LNet is setup to work.  Going over 1M size would result in an unrealistic 
Lustre test.

Doug

> On Feb 5, 2017, at 11:55 AM, Jeff Johnson 
> > wrote:
>
> Without seeing your entire command it is hard to say for sure but I would 
> make sure your concurrency option is set to 8 for starters.
>
> --Jeff
>
> Sent from my iPhone
>
>> On Feb 5, 2017, at 11:30, Jon Tegner > 
>> wrote:
>>
>> Hi,
>>
>> I'm trying to use lnet selftest to evaluate network performance on a test 
>> setup (only two machines). Using e.g., iperf or Netpipe I've managed to 
>> demonstrate the bandwidth of the underlying 10 Gbits/s network (and 
>> typically you reach the expected bandwidth as the packet size increases).
>>
>> How can I do the same using lnet selftest (i.e., verifying the bandwidth of 
>> the underlying hardware)? My initial thought was to increase the I/O size, 
>> but it seems the maximum size one can use is "--size=1M".
>>
>> Thanks,
>>
>> /jon
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-06 Thread Jon Tegner

Hi,

I used the following script:

#!/bin/bash
export LST_SESSION=$$
lst new_session read/write
lst add_group servers 10.0.12.12@o2ib
lst add_group readers 10.0.12.11@o2ib
lst add_group writers 10.0.12.11@o2ib
lst add_batch bulk_rw
lst add_test --batch bulk_rw --concurrency 12 --from readers --to servers \
brw read check=simple size=1M
lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
brw write check=simple size=1M
# start running
lst run bulk_rw
# display server stats for 30 seconds
lst stat servers & sleep 30; kill $!
# tear down
lst end_session

and tried with concurrency from 0,2,4,8,12,16, results in

http://renget.se/lnetBandwidth.png
and
http://renget.se/lnetRates.png

From Bandwidth a max of just below 2800 MB/s can be noted. Since in 
this case "readers" and "writers" are the same, I did a few tests with 
the line


lst add_test --batch bulk_rw --concurrency 12 --from writers --to servers \
brw write check=simple size=1M

removed from the script - which resulted in a bandwidth of around 3600 MB/s.

I also did tests using mpitests-osu_bw from openmpi, and in that case I 
monitored a bandwidth of about 3900 MB/s.


Considering the "openmpi-bandwidth" should I be happy with the numbers 
obtained by LNet selftest? Is there a way to modify the test so that the 
result gets closer to what openmpi is giving? And what can be said of 
the "Rates of servers (RPC/s)" - are they "good" or "bad"? What to 
compare them with?


Thanks!

/jon

On 02/05/2017 08:55 PM, Jeff Johnson wrote:

Without seeing your entire command it is hard to say for sure but I would make 
sure your concurrency option is set to 8 for starters.

--Jeff

Sent from my iPhone


On Feb 5, 2017, at 11:30, Jon Tegner  wrote:

Hi,

I'm trying to use lnet selftest to evaluate network performance on a test setup 
(only two machines). Using e.g., iperf or Netpipe I've managed to demonstrate 
the bandwidth of the underlying 10 Gbits/s network (and typically you reach the 
expected bandwidth as the packet size increases).

How can I do the same using lnet selftest (i.e., verifying the bandwidth of the 
underlying hardware)? My initial thought was to increase the I/O size, but it seems the 
maximum size one can use is "--size=1M".

Thanks,

/jon
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org