Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Nguyen Viet Cuong
David,

You are right, there is a lock. As Patrick mentioned,
https://jira.hpdd.intel.com/browse/LU-1669 will solve your problems. Please
check it out.

In my own experience, Lustre 2.7.0 client does solve such problem very
well, and I got a very good performance so far.

Regards,
Cuong

On Wed, May 20, 2015 at 4:46 AM, David A. Schneider 
david...@slac.stanford.edu wrote:

 We do use checksums, but can't turn it off. It know we've measured some
 performance penalty with checksums. I'll check about configuring lustre
 clients to to use RDMA. We ran into something similar where our MPI
 programs were not taking advantage of the infini-band - we noticed much
 slower message passing then we expected - it sounds like there is a similar
 thing we can do with lustre, but I guess the locking is the main issue. All
 our compute nodes are currently running red hat 5 and it doesn't look like
 lustre 2.6 was tested with rhel5, but we have been talking about moving
 everything to at least rhel6, maybe rhel7, so there's hope, Thanks for the
 help!

 best,

 David


 On 05/19/15 11:10, Patrick Farrell wrote:

 Ah.  I think I know what¹s going on here:

 In Lustre 2.x client versions prior to 2.6, only one process on a given
 client can write to a given file at a time, regardless of how the file is
 striped.  So if you are writing to the same file, there will be little to
 no benefit of putting an extra process on the same node.

 A *single* process on a node could benefit, but not the split you¹ve
 described.

 The details, which are essentially just that a pair of per-file locks are
 used by any individual process writing to a file, are here:
 https://jira.hpdd.intel.com/browse/LU-1669


 On 5/19/15, 12:59 PM, Mohr Jr, Richard Frank (Rick Mohr) rm...@utk.edu
 
 wrote:

  On May 19, 2015, at 1:44 PM, Schneider, David A.
 david...@slac.stanford.edu wrote:

 Thanks for the suggestion! When I had each rank run on a separate
 compute node/host, I saw parallel performance (4 seconds for the 6GB of
 writing). When I ran the MPI job on one host (the hosts have 12 cores,
 by default we pack ranks onto as few hosts as possible), things happened
 serially, each rank finished about 2 seconds after a different rank.

 Hmm. That does seem like there is some bottleneck on the client side that
 is limiting the throughput from a single client.  Here are some things
 you could look into (although they might require more tinkering than you
 have permission to do):

 1) Based on your output from ³lctl list_nids², it looks like you are
 running IP-over-IB.  Can you configure the clients to use RDMA?  (They
 would have nids like x.x.x.x@o2ib.)

 2) Do you have the option of trying a newer client version?  Earlier
 lustre versions used a single-thread ptlrpcd to manage network traffic,
 but newer versions have a multi-threaded implementation.  You may need to
 compare compatibility with the Lustre version running on the servers
 though.

 3) Do you gave checksums disabled?  Try running lctl get_param
 osc.*.checksums².  If the values are ³1², then checksums are enabled
 which can slow down performance.  You could try setting the value to ³0²
 to see if that helps.

 --
 Rick Mohr
 Senior HPC System Administrator
 National Institute for Computational Sciences
 http://www.nics.tennessee.edu

 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




-- 
Nguyen Viet Cuong
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Jeff Johnson

David,

What interconnect are you using for Lustre? ( IB/o2ib [fdr,qdr,ddr], 
Ethernet/tcp [40GbE,10Gbe,1GbE] ). You can run 'lctl list_nids' and see 
what protocol lnet is binding to, then look at that interface for the 
specific type.


Also, do you know anything about the server side of your Lustre FS? What 
make/model of block devices are used in OSTs?


--Jeff


On 5/19/15 9:05 AM, Schneider, David A. wrote:

Thanks, for the client, where I am running from, I have

$ cat /proc/fs/lustre/version
lustre: 2.1.6
kernel: patchless_client
build:  jenkins--PRISTINE-2.6.18-348.4.1.el5


best,

David Schneider

From: Patrick Farrell [p...@cray.com]
Sent: Tuesday, May 19, 2015 9:03 AM
To: Schneider, David A.; John Bauer; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to single 
file

For the clients, cat /proc/fs/lustre/version

For the servers, it¹s the same, but presumably you don¹t have access.

On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu
wrote:


Hi,

My first test was just to do the for loop where I allocate a 4MB buffer,
initialize it, and delete it. That program ran at about 6GB/sec. Once I
write to a file, I drop down to 370mb/sec. Our top performance for I/O to
one file has been about 400 mb/sec.

For this question: Which versions are you using in servers and clients?
I don't know what command to determine this, I suspect it is older since
we are on red hat 5. I will ask.

best,

David Schneider

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
of John Bauer [bau...@iodoctors.com]
Sent: Tuesday, May 19, 2015 8:52 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to
single file

David

You note that you write a 6GB file.  I suspect that your Linux systems
have significantly more memory than 6GB meaning your file will end being
cached in the system buffers.  It wont matter how many OSTs you use as
you probably are not measuring the speed to the OST's, but rather, you
are measuring the memory copy speed.
What transfer rate are you seeing?

John

On 5/19/2015 10:40 AM, Schneider, David A. wrote:

I am trying to get good performance with parallel writing to one file
through MPI. Our cluster has high performance when I write to separate
files, but when I use one file - I see very little performance increase.

As I understand, our cluster defaults to use one OST per file. There
are many OST's though, which is how we get good performance when writing
to multiple files. I have been using the command

   lfs setstripe

to change the stripe count and block size. I can see that this works,
when I do lfs getstripe, I see the output file is striped, but I'm
getting very little I/O performance when I create the striped file.

When working from hdf5 and mpi, I have seen a number of references
about tuning parameters, I haven't dug into this yet. I first want to
make sure lustre has the high output performance at a basic level. I
tried to write a C program uses simple POSIX calls (open and looping
over writes) but I don't see much increase in performance (I've tried 8
and 19 OST's, 1MB and 4MB chunks, I write a 6GB file).

Does anyone know if this should work? What is the simplest C program I
could write to see an increase in output performance after I stripe? Do
I need separate processes/threads with separate file handles? I am on
linux red hat 5. I'm not sure what version of lustre this is. I have
skimmed through a 450 page pdf of lustre documentation, I saw references
to destructive testing one does in the beginning, but I'm not sure what
I can do now. I think this is the first work we've done to get high
performance when writing a single file, so I'm worried there is
something buried in the lustre configuration that needs to be changed. I
can run /usr/sbin/lcntl, maybe there are certain parameters I should
check?

best,

David Schneider
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



--
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite

Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread David A. Schneider
We do use checksums, but can't turn it off. It know we've measured some 
performance penalty with checksums. I'll check about configuring lustre 
clients to to use RDMA. We ran into something similar where our MPI 
programs were not taking advantage of the infini-band - we noticed much 
slower message passing then we expected - it sounds like there is a 
similar thing we can do with lustre, but I guess the locking is the main 
issue. All our compute nodes are currently running red hat 5 and it 
doesn't look like lustre 2.6 was tested with rhel5, but we have been 
talking about moving everything to at least rhel6, maybe rhel7, so 
there's hope, Thanks for the help!


best,

David

On 05/19/15 11:10, Patrick Farrell wrote:

Ah.  I think I know what¹s going on here:

In Lustre 2.x client versions prior to 2.6, only one process on a given
client can write to a given file at a time, regardless of how the file is
striped.  So if you are writing to the same file, there will be little to
no benefit of putting an extra process on the same node.

A *single* process on a node could benefit, but not the split you¹ve
described.

The details, which are essentially just that a pair of per-file locks are
used by any individual process writing to a file, are here:
https://jira.hpdd.intel.com/browse/LU-1669


On 5/19/15, 12:59 PM, Mohr Jr, Richard Frank (Rick Mohr) rm...@utk.edu
wrote:


On May 19, 2015, at 1:44 PM, Schneider, David A.
david...@slac.stanford.edu wrote:

Thanks for the suggestion! When I had each rank run on a separate
compute node/host, I saw parallel performance (4 seconds for the 6GB of
writing). When I ran the MPI job on one host (the hosts have 12 cores,
by default we pack ranks onto as few hosts as possible), things happened
serially, each rank finished about 2 seconds after a different rank.

Hmm. That does seem like there is some bottleneck on the client side that
is limiting the throughput from a single client.  Here are some things
you could look into (although they might require more tinkering than you
have permission to do):

1) Based on your output from ³lctl list_nids², it looks like you are
running IP-over-IB.  Can you configure the clients to use RDMA?  (They
would have nids like x.x.x.x@o2ib.)

2) Do you have the option of trying a newer client version?  Earlier
lustre versions used a single-thread ptlrpcd to manage network traffic,
but newer versions have a multi-threaded implementation.  You may need to
compare compatibility with the Lustre version running on the servers
though.

3) Do you gave checksums disabled?  Try running lctl get_param
osc.*.checksums².  If the values are ³1², then checksums are enabled
which can slow down performance.  You could try setting the value to ³0²
to see if that helps.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Schneider, David A.
Hi,

My first test was just to do the for loop where I allocate a 4MB buffer, 
initialize it, and delete it. That program ran at about 6GB/sec. Once I write 
to a file, I drop down to 370mb/sec. Our top performance for I/O to one file 
has been about 400 mb/sec.

For this question: Which versions are you using in servers and clients? 
I don't know what command to determine this, I suspect it is older since we are 
on red hat 5. I will ask.

best,

David Schneider

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
John Bauer [bau...@iodoctors.com]
Sent: Tuesday, May 19, 2015 8:52 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to single 
file

David

You note that you write a 6GB file.  I suspect that your Linux systems
have significantly more memory than 6GB meaning your file will end being
cached in the system buffers.  It wont matter how many OSTs you use as
you probably are not measuring the speed to the OST's, but rather, you
are measuring the memory copy speed.
What transfer rate are you seeing?

John

On 5/19/2015 10:40 AM, Schneider, David A. wrote:
 I am trying to get good performance with parallel writing to one file through 
 MPI. Our cluster has high performance when I write to separate files, but 
 when I use one file - I see very little performance increase.

 As I understand, our cluster defaults to use one OST per file. There are many 
 OST's though, which is how we get good performance when writing to multiple 
 files. I have been using the command

   lfs setstripe

 to change the stripe count and block size. I can see that this works, when I 
 do lfs getstripe, I see the output file is striped, but I'm getting very 
 little I/O performance when I create the striped file.

 When working from hdf5 and mpi, I have seen a number of references about 
 tuning parameters, I haven't dug into this yet. I first want to make sure 
 lustre has the high output performance at a basic level. I tried to write a C 
 program uses simple POSIX calls (open and looping over writes) but I don't 
 see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB 
 chunks, I write a 6GB file).

 Does anyone know if this should work? What is the simplest C program I could 
 write to see an increase in output performance after I stripe? Do I need 
 separate processes/threads with separate file handles? I am on linux red hat 
 5. I'm not sure what version of lustre this is. I have skimmed through a 450 
 page pdf of lustre documentation, I saw references to destructive testing one 
 does in the beginning, but I'm not sure what I can do now. I think this is 
 the first work we've done to get high performance when writing a single file, 
 so I'm worried there is something buried in the lustre configuration that 
 needs to be changed. I can run /usr/sbin/lcntl, maybe there are certain 
 parameters I should check?

 best,

 David Schneider
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Mohr Jr, Richard Frank (Rick Mohr)

 On May 19, 2015, at 11:40 AM, Schneider, David A. 
 david...@slac.stanford.edu wrote:
 
 When working from hdf5 and mpi, I have seen a number of references about 
 tuning parameters, I haven't dug into this yet. I first want to make sure 
 lustre has the high output performance at a basic level. I tried to write a C 
 program uses simple POSIX calls (open and looping over writes) but I don't 
 see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB 
 chunks, I write a 6GB file). 
 
 Does anyone know if this should work? What is the simplest C program I could 
 write to see an increase in output performance after I stripe? Do I need 
 separate processes/threads with separate file handles?

If you are looking for a simple shared-file test, you could try something like 
this:

1) Create a file with a stripe size of 1 GB and a stripe count of 6.

2) Write an MPI program where each process writes 1 GB of sequential data.  
Each process should first seek to (mpi_rank)*(1GB) and then write 1 GB.  This 
will ensure that all processes are writing to non-overlapping parts of the file.

3) Start the program running on 6 nodes (1 process per node).

In a scenario like that, you should effectively be getting file-per-process 
speeds even though you are writing to a shared file because each process is 
writing to a different OST.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Patrick Farrell
For the clients, cat /proc/fs/lustre/version

For the servers, it¹s the same, but presumably you don¹t have access.

On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu
wrote:

Hi,

My first test was just to do the for loop where I allocate a 4MB buffer,
initialize it, and delete it. That program ran at about 6GB/sec. Once I
write to a file, I drop down to 370mb/sec. Our top performance for I/O to
one file has been about 400 mb/sec.

For this question: Which versions are you using in servers and clients?
I don't know what command to determine this, I suspect it is older since
we are on red hat 5. I will ask.

best,

David Schneider

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
of John Bauer [bau...@iodoctors.com]
Sent: Tuesday, May 19, 2015 8:52 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to
single file

David

You note that you write a 6GB file.  I suspect that your Linux systems
have significantly more memory than 6GB meaning your file will end being
cached in the system buffers.  It wont matter how many OSTs you use as
you probably are not measuring the speed to the OST's, but rather, you
are measuring the memory copy speed.
What transfer rate are you seeing?

John

On 5/19/2015 10:40 AM, Schneider, David A. wrote:
 I am trying to get good performance with parallel writing to one file
through MPI. Our cluster has high performance when I write to separate
files, but when I use one file - I see very little performance increase.

 As I understand, our cluster defaults to use one OST per file. There
are many OST's though, which is how we get good performance when writing
to multiple files. I have been using the command

   lfs setstripe

 to change the stripe count and block size. I can see that this works,
when I do lfs getstripe, I see the output file is striped, but I'm
getting very little I/O performance when I create the striped file.

 When working from hdf5 and mpi, I have seen a number of references
about tuning parameters, I haven't dug into this yet. I first want to
make sure lustre has the high output performance at a basic level. I
tried to write a C program uses simple POSIX calls (open and looping
over writes) but I don't see much increase in performance (I've tried 8
and 19 OST's, 1MB and 4MB chunks, I write a 6GB file).

 Does anyone know if this should work? What is the simplest C program I
could write to see an increase in output performance after I stripe? Do
I need separate processes/threads with separate file handles? I am on
linux red hat 5. I'm not sure what version of lustre this is. I have
skimmed through a 450 page pdf of lustre documentation, I saw references
to destructive testing one does in the beginning, but I'm not sure what
I can do now. I think this is the first work we've done to get high
performance when writing a single file, so I'm worried there is
something buried in the lustre configuration that needs to be changed. I
can run /usr/sbin/lcntl, maybe there are certain parameters I should
check?

 best,

 David Schneider
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Schneider, David A.
I am trying to get good performance with parallel writing to one file through 
MPI. Our cluster has high performance when I write to separate files, but when 
I use one file - I see very little performance increase.

As I understand, our cluster defaults to use one OST per file. There are many 
OST's though, which is how we get good performance when writing to multiple 
files. I have been using the command

 lfs setstripe 

to change the stripe count and block size. I can see that this works, when I do 
lfs getstripe, I see the output file is striped, but I'm getting very little 
I/O performance when I create the striped file. 

When working from hdf5 and mpi, I have seen a number of references about tuning 
parameters, I haven't dug into this yet. I first want to make sure lustre has 
the high output performance at a basic level. I tried to write a C program uses 
simple POSIX calls (open and looping over writes) but I don't see much increase 
in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB 
file). 

Does anyone know if this should work? What is the simplest C program I could 
write to see an increase in output performance after I stripe? Do I need 
separate processes/threads with separate file handles? I am on linux red hat 5. 
I'm not sure what version of lustre this is. I have skimmed through a 450 page 
pdf of lustre documentation, I saw references to destructive testing one does 
in the beginning, but I'm not sure what I can do now. I think this is the first 
work we've done to get high performance when writing a single file, so I'm 
worried there is something buried in the lustre configuration that needs to be 
changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should 
check? 

best,

David Schneider
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Nguyen Viet Cuong
Which versions are you using in servers and clients?

On Wed, May 20, 2015 at 12:40 AM, Schneider, David A. 
david...@slac.stanford.edu wrote:

 I am trying to get good performance with parallel writing to one file
 through MPI. Our cluster has high performance when I write to separate
 files, but when I use one file - I see very little performance increase.

 As I understand, our cluster defaults to use one OST per file. There are
 many OST's though, which is how we get good performance when writing to
 multiple files. I have been using the command

  lfs setstripe

 to change the stripe count and block size. I can see that this works, when
 I do lfs getstripe, I see the output file is striped, but I'm getting very
 little I/O performance when I create the striped file.

 When working from hdf5 and mpi, I have seen a number of references about
 tuning parameters, I haven't dug into this yet. I first want to make sure
 lustre has the high output performance at a basic level. I tried to write a
 C program uses simple POSIX calls (open and looping over writes) but I
 don't see much increase in performance (I've tried 8 and 19 OST's, 1MB and
 4MB chunks, I write a 6GB file).

 Does anyone know if this should work? What is the simplest C program I
 could write to see an increase in output performance after I stripe? Do I
 need separate processes/threads with separate file handles? I am on linux
 red hat 5. I'm not sure what version of lustre this is. I have skimmed
 through a 450 page pdf of lustre documentation, I saw references to
 destructive testing one does in the beginning, but I'm not sure what I can
 do now. I think this is the first work we've done to get high performance
 when writing a single file, so I'm worried there is something buried in the
 lustre configuration that needs to be changed. I can run /usr/sbin/lcntl,
 maybe there are certain parameters I should check?

 best,

 David Schneider
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




-- 
Nguyen Viet Cuong
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread John Bauer

David

You note that you write a 6GB file.  I suspect that your Linux systems 
have significantly more memory than 6GB meaning your file will end being 
cached in the system buffers.  It wont matter how many OSTs you use as 
you probably are not measuring the speed to the OST's, but rather, you 
are measuring the memory copy speed.

What transfer rate are you seeing?

John

On 5/19/2015 10:40 AM, Schneider, David A. wrote:

I am trying to get good performance with parallel writing to one file through 
MPI. Our cluster has high performance when I write to separate files, but when 
I use one file - I see very little performance increase.

As I understand, our cluster defaults to use one OST per file. There are many 
OST's though, which is how we get good performance when writing to multiple 
files. I have been using the command

  lfs setstripe

to change the stripe count and block size. I can see that this works, when I do 
lfs getstripe, I see the output file is striped, but I'm getting very little 
I/O performance when I create the striped file.

When working from hdf5 and mpi, I have seen a number of references about tuning 
parameters, I haven't dug into this yet. I first want to make sure lustre has 
the high output performance at a basic level. I tried to write a C program uses 
simple POSIX calls (open and looping over writes) but I don't see much increase 
in performance (I've tried 8 and 19 OST's, 1MB and 4MB chunks, I write a 6GB 
file).

Does anyone know if this should work? What is the simplest C program I could 
write to see an increase in output performance after I stripe? Do I need 
separate processes/threads with separate file handles? I am on linux red hat 5. 
I'm not sure what version of lustre this is. I have skimmed through a 450 page 
pdf of lustre documentation, I saw references to destructive testing one does 
in the beginning, but I'm not sure what I can do now. I think this is the first 
work we've done to get high performance when writing a single file, so I'm 
worried there is something buried in the lustre configuration that needs to be 
changed. I can run /usr/sbin/lcntl, maybe there are certain parameters I should 
check?

best,

David Schneider
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Schneider, David A.
Thanks, for the client, where I am running from, I have 

$ cat /proc/fs/lustre/version
lustre: 2.1.6
kernel: patchless_client
build:  jenkins--PRISTINE-2.6.18-348.4.1.el5


best,

David Schneider

From: Patrick Farrell [p...@cray.com]
Sent: Tuesday, May 19, 2015 9:03 AM
To: Schneider, David A.; John Bauer; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to single 
file

For the clients, cat /proc/fs/lustre/version

For the servers, it¹s the same, but presumably you don¹t have access.

On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu
wrote:

Hi,

My first test was just to do the for loop where I allocate a 4MB buffer,
initialize it, and delete it. That program ran at about 6GB/sec. Once I
write to a file, I drop down to 370mb/sec. Our top performance for I/O to
one file has been about 400 mb/sec.

For this question: Which versions are you using in servers and clients?
I don't know what command to determine this, I suspect it is older since
we are on red hat 5. I will ask.

best,

David Schneider

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
of John Bauer [bau...@iodoctors.com]
Sent: Tuesday, May 19, 2015 8:52 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to
single file

David

You note that you write a 6GB file.  I suspect that your Linux systems
have significantly more memory than 6GB meaning your file will end being
cached in the system buffers.  It wont matter how many OSTs you use as
you probably are not measuring the speed to the OST's, but rather, you
are measuring the memory copy speed.
What transfer rate are you seeing?

John

On 5/19/2015 10:40 AM, Schneider, David A. wrote:
 I am trying to get good performance with parallel writing to one file
through MPI. Our cluster has high performance when I write to separate
files, but when I use one file - I see very little performance increase.

 As I understand, our cluster defaults to use one OST per file. There
are many OST's though, which is how we get good performance when writing
to multiple files. I have been using the command

   lfs setstripe

 to change the stripe count and block size. I can see that this works,
when I do lfs getstripe, I see the output file is striped, but I'm
getting very little I/O performance when I create the striped file.

 When working from hdf5 and mpi, I have seen a number of references
about tuning parameters, I haven't dug into this yet. I first want to
make sure lustre has the high output performance at a basic level. I
tried to write a C program uses simple POSIX calls (open and looping
over writes) but I don't see much increase in performance (I've tried 8
and 19 OST's, 1MB and 4MB chunks, I write a 6GB file).

 Does anyone know if this should work? What is the simplest C program I
could write to see an increase in output performance after I stripe? Do
I need separate processes/threads with separate file handles? I am on
linux red hat 5. I'm not sure what version of lustre this is. I have
skimmed through a 450 page pdf of lustre documentation, I saw references
to destructive testing one does in the beginning, but I'm not sure what
I can do now. I think this is the first work we've done to get high
performance when writing a single file, so I'm worried there is
something buried in the lustre configuration that needs to be changed. I
can run /usr/sbin/lcntl, maybe there are certain parameters I should
check?

 best,

 David Schneider
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Schneider, David A.
Hi Jeff,

I know we have infini-band, however when I ran lctl, what I see (maybe I should 
not put our ip addresses on the internet, so I'll xxx them out) is

.xx.xx.xx@tcp2
.xx.xx.xx@tcp

unfortunately, I'm not sure how to look at the interface for these types, maybe 
they are in turn connected to infiniband.

I don't know much about the OSTs. I know there is a raid structure that allows 
for the 400MB/sec on each one. In one of my tests, I believe I wrote 44GB in 
100 separate files in under 10 seconds, so the system can support 4.4GB/sec.

best,

David Schneider

From: Jeff Johnson [jeff.john...@aeoncomputing.com]
Sent: Tuesday, May 19, 2015 9:11 AM
To: Schneider, David A.; Patrick Farrell; John Bauer; 
lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to single 
file

David,

What interconnect are you using for Lustre? ( IB/o2ib [fdr,qdr,ddr],
Ethernet/tcp [40GbE,10Gbe,1GbE] ). You can run 'lctl list_nids' and see
what protocol lnet is binding to, then look at that interface for the
specific type.

Also, do you know anything about the server side of your Lustre FS? What
make/model of block devices are used in OSTs?

--Jeff


On 5/19/15 9:05 AM, Schneider, David A. wrote:
 Thanks, for the client, where I am running from, I have

 $ cat /proc/fs/lustre/version
 lustre: 2.1.6
 kernel: patchless_client
 build:  jenkins--PRISTINE-2.6.18-348.4.1.el5


 best,

 David Schneider
 
 From: Patrick Farrell [p...@cray.com]
 Sent: Tuesday, May 19, 2015 9:03 AM
 To: Schneider, David A.; John Bauer; lustre-discuss@lists.lustre.org
 Subject: Re: [lustre-discuss] problem getting high performance output to 
 single file

 For the clients, cat /proc/fs/lustre/version

 For the servers, it¹s the same, but presumably you don¹t have access.

 On 5/19/15, 11:01 AM, Schneider, David A. david...@slac.stanford.edu
 wrote:

 Hi,

 My first test was just to do the for loop where I allocate a 4MB buffer,
 initialize it, and delete it. That program ran at about 6GB/sec. Once I
 write to a file, I drop down to 370mb/sec. Our top performance for I/O to
 one file has been about 400 mb/sec.

 For this question: Which versions are you using in servers and clients?
 I don't know what command to determine this, I suspect it is older since
 we are on red hat 5. I will ask.

 best,

 David Schneider
 
 From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf
 of John Bauer [bau...@iodoctors.com]
 Sent: Tuesday, May 19, 2015 8:52 AM
 To: lustre-discuss@lists.lustre.org
 Subject: Re: [lustre-discuss] problem getting high performance output to
 single file

 David

 You note that you write a 6GB file.  I suspect that your Linux systems
 have significantly more memory than 6GB meaning your file will end being
 cached in the system buffers.  It wont matter how many OSTs you use as
 you probably are not measuring the speed to the OST's, but rather, you
 are measuring the memory copy speed.
 What transfer rate are you seeing?

 John

 On 5/19/2015 10:40 AM, Schneider, David A. wrote:
 I am trying to get good performance with parallel writing to one file
 through MPI. Our cluster has high performance when I write to separate
 files, but when I use one file - I see very little performance increase.

 As I understand, our cluster defaults to use one OST per file. There
 are many OST's though, which is how we get good performance when writing
 to multiple files. I have been using the command

lfs setstripe

 to change the stripe count and block size. I can see that this works,
 when I do lfs getstripe, I see the output file is striped, but I'm
 getting very little I/O performance when I create the striped file.

 When working from hdf5 and mpi, I have seen a number of references
 about tuning parameters, I haven't dug into this yet. I first want to
 make sure lustre has the high output performance at a basic level. I
 tried to write a C program uses simple POSIX calls (open and looping
 over writes) but I don't see much increase in performance (I've tried 8
 and 19 OST's, 1MB and 4MB chunks, I write a 6GB file).

 Does anyone know if this should work? What is the simplest C program I
 could write to see an increase in output performance after I stripe? Do
 I need separate processes/threads with separate file handles? I am on
 linux red hat 5. I'm not sure what version of lustre this is. I have
 skimmed through a 450 page pdf of lustre documentation, I saw references
 to destructive testing one does in the beginning, but I'm not sure what
 I can do now. I think this is the first work we've done to get high
 performance when writing a single file, so I'm worried there is
 something buried in the lustre configuration that needs to be changed. I
 can run /usr/sbin/lcntl, maybe there are certain parameters I should
 check?

 best,

 David Schneider

Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Mohr Jr, Richard Frank (Rick Mohr)

 On May 19, 2015, at 1:44 PM, Schneider, David A. david...@slac.stanford.edu 
 wrote:
 
 Thanks for the suggestion! When I had each rank run on a separate compute 
 node/host, I saw parallel performance (4 seconds for the 6GB of writing). 
 When I ran the MPI job on one host (the hosts have 12 cores, by default we 
 pack ranks onto as few hosts as possible), things happened serially, each 
 rank finished about 2 seconds after a different rank.

Hmm. That does seem like there is some bottleneck on the client side that is 
limiting the throughput from a single client.  Here are some things you could 
look into (although they might require more tinkering than you have permission 
to do):

1) Based on your output from “lctl list_nids”, it looks like you are running 
IP-over-IB.  Can you configure the clients to use RDMA?  (They would have nids 
like x.x.x.x@o2ib.)

2) Do you have the option of trying a newer client version?  Earlier lustre 
versions used a single-thread ptlrpcd to manage network traffic, but newer 
versions have a multi-threaded implementation.  You may need to compare 
compatibility with the Lustre version running on the servers though.

3) Do you gave checksums disabled?  Try running lctl get_param 
osc.*.checksums”.  If the values are “1”, then checksums are enabled which can 
slow down performance.  You could try setting the value to “0” to see if that 
helps.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Schneider, David A.
Thanks for the suggestion! When I had each rank run on a separate compute 
node/host, I saw parallel performance (4 seconds for the 6GB of writing). When 
I ran the MPI job on one host (the hosts have 12 cores, by default we pack 
ranks onto as few hosts as possible), things happened serially, each rank 
finished about 2 seconds after a different rank. I'm told that the hosts can 
handle a lot of I/O, but it seems there a some issues with getting that to work 
well. I believe we get good performance with different ranks on one host 
reading from different files. I'll look into tuning the MPI/Hdf5 parameter now, 
with an eye for designing my application to write from different hosts. My 
initial tests with MPI showed degraded performance when I used different hosts 
for the writing, but maybe there are some parameters that will help. I can try 
the openmpi forum at that point. 

best,

David Schneider

From: Mohr Jr, Richard Frank (Rick Mohr) [rm...@utk.edu]
Sent: Tuesday, May 19, 2015 9:15 AM
To: Schneider, David A.
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] problem getting high performance output to single 
file

 On May 19, 2015, at 11:40 AM, Schneider, David A. 
 david...@slac.stanford.edu wrote:

 When working from hdf5 and mpi, I have seen a number of references about 
 tuning parameters, I haven't dug into this yet. I first want to make sure 
 lustre has the high output performance at a basic level. I tried to write a C 
 program uses simple POSIX calls (open and looping over writes) but I don't 
 see much increase in performance (I've tried 8 and 19 OST's, 1MB and 4MB 
 chunks, I write a 6GB file).

 Does anyone know if this should work? What is the simplest C program I could 
 write to see an increase in output performance after I stripe? Do I need 
 separate processes/threads with separate file handles?

If you are looking for a simple shared-file test, you could try something like 
this:

1) Create a file with a stripe size of 1 GB and a stripe count of 6.

2) Write an MPI program where each process writes 1 GB of sequential data.  
Each process should first seek to (mpi_rank)*(1GB) and then write 1 GB.  This 
will ensure that all processes are writing to non-overlapping parts of the file.

3) Start the program running on 6 nodes (1 process per node).

In a scenario like that, you should effectively be getting file-per-process 
speeds even though you are writing to a shared file because each process is 
writing to a different OST.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] problem getting high performance output to single file

2015-05-19 Thread Patrick Farrell
Ah.  I think I know what¹s going on here:

In Lustre 2.x client versions prior to 2.6, only one process on a given
client can write to a given file at a time, regardless of how the file is
striped.  So if you are writing to the same file, there will be little to
no benefit of putting an extra process on the same node.

A *single* process on a node could benefit, but not the split you¹ve
described.

The details, which are essentially just that a pair of per-file locks are
used by any individual process writing to a file, are here:
https://jira.hpdd.intel.com/browse/LU-1669


On 5/19/15, 12:59 PM, Mohr Jr, Richard Frank (Rick Mohr) rm...@utk.edu
wrote:


 On May 19, 2015, at 1:44 PM, Schneider, David A.
david...@slac.stanford.edu wrote:
 
 Thanks for the suggestion! When I had each rank run on a separate
compute node/host, I saw parallel performance (4 seconds for the 6GB of
writing). When I ran the MPI job on one host (the hosts have 12 cores,
by default we pack ranks onto as few hosts as possible), things happened
serially, each rank finished about 2 seconds after a different rank.

Hmm. That does seem like there is some bottleneck on the client side that
is limiting the throughput from a single client.  Here are some things
you could look into (although they might require more tinkering than you
have permission to do):

1) Based on your output from ³lctl list_nids², it looks like you are
running IP-over-IB.  Can you configure the clients to use RDMA?  (They
would have nids like x.x.x.x@o2ib.)

2) Do you have the option of trying a newer client version?  Earlier
lustre versions used a single-thread ptlrpcd to manage network traffic,
but newer versions have a multi-threaded implementation.  You may need to
compare compatibility with the Lustre version running on the servers
though.

3) Do you gave checksums disabled?  Try running lctl get_param
osc.*.checksums².  If the values are ³1², then checksums are enabled
which can slow down performance.  You could try setting the value to ³0²
to see if that helps.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org