Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN: effect of ignorePrefetchLUNCount

2020-06-16 Thread Jan-Frode Myklebust
tir. 16. jun. 2020 kl. 15:32 skrev Giovanni Bracco :

>
> > I would correct MaxMBpS -- put it at something reasonable, enable
> > verbsRdmaSend=yes and
> > ignorePrefetchLUNCount=yes.
>
> Now we have set:
> verbsRdmaSend yes
> ignorePrefetchLUNCount yes
> maxMBpS 8000
>
> but the only parameter which has a strong effect by itself is
>
> ignorePrefetchLUNCount yes
>
> and the readout performance increased of a factor at least 4, from
> 50MB/s to 210 MB/s



That’s interesting.. ignoreprefetchluncount=yes should mean it more
aggresively schedules IO. Did you also try lowering maxMBpS? I’m thinking
maybe something is getting flooded somewhere..

Another knob would be to increase workerThreads, and/or prefetchPct (don’t
quite renember how these influence each other).

And it would be useful to run nsdperf between client and nsd-servers, to
verify/rule out any network issue.


> fio --name=seqwrite --rw=write --buffered=1 --ioengine=posixaio --bs=1m
> --numjobs=1 --size=100G --runtime=60
>
> fio --name=seqread --rw=wread --buffered=1 --ioengine=posixaio --bs=1m
> --numjobs=1 --size=100G --runtime=60
>
>
Not too familiar with fio, but ... does it help to increase numjobs?

And.. do you tell both sides which fabric number they’re on («verbsPorts
qib0/1/1») so the GPFS knows not to try to connect verbsPorts that can’t
communicate?


  -jf
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN: effect of ignorePrefetchLUNCount

2020-06-16 Thread Giovanni Bracco

On 11/06/20 12:13, Jan-Frode Myklebust wrote:
On Thu, Jun 11, 2020 at 9:53 AM Giovanni Bracco > wrote:



 >
 > You could potentially still do SRP from QDR nodes, and via NSD
for your
 > omnipath nodes. Going via NSD seems like a bit pointless indirection.

not really: both clusters, the 400 OPA nodes and the 300 QDR nodes
share
the same data lake in Spectrum Scale/GPFS so the NSD servers support
the
flexibility of the setup.


Maybe there's something I don't understand, but couldn't you use the 
NSD-servers to serve to your

OPA nodes, and then SRP directly for your 300 QDR-nodes??


not in an easy way without loosing the flexibility of the system where 
NSD are the hubs between the three different fabrics, QDR compute, OPA 
compute, Mellanox FDR SAN.


The storages have QDR,FDR and EDR interfaces and Mellanox guarantees the 
compatibility QDR-FDR and FDR-EDR but not, as far as I know, QDR-EDR
So in this configuration, all the compute nodes can access to all the 
storages.





At this moment this is the output of mmlsconfig

# mmlsconfig
Configuration data for cluster GPFSEXP.portici.enea.it
:
---
clusterName GPFSEXP.portici.enea.it 
clusterId 13274694257874519577
autoload no
dmapiFileHandleSize 32
minReleaseLevel 5.0.4.0
ccrEnabled yes
cipherList AUTHONLY
verbsRdma enable
verbsPorts qib0/1
[cresco-gpfq7,cresco-gpfq8]
verbsPorts qib0/2
[common]
pagepool 4G
adminMode central

File systems in cluster GPFSEXP.portici.enea.it
:

/dev/vsd_gexp2
/dev/vsd_gexp3



So, trivial close to default config.. assume the same for the client 
cluster.


I would correct MaxMBpS -- put it at something reasonable, enable 
verbsRdmaSend=yes and

ignorePrefetchLUNCount=yes.


Now we have set:
verbsRdmaSend yes
ignorePrefetchLUNCount yes
maxMBpS 8000

but the only parameter which has a strong effect by itself is

ignorePrefetchLUNCount yes

and the readout performance increased of a factor at least 4, from 
50MB/s to 210 MB/s


So from the client now the situation is:

Sequential write 800 MB/s, sequential read 200 MB/s, much better then 
before but still a factor 3, both Write/Read compared what is observed 
from the NSD node:


Sequential write 2300 MB/s, sequential read 600 MB/s

As far as the test is concerned I have seen that the lmdd results are 
very similar to


fio --name=seqwrite --rw=write --buffered=1 --ioengine=posixaio --bs=1m 
--numjobs=1 --size=100G --runtime=60


fio --name=seqread --rw=wread --buffered=1 --ioengine=posixaio --bs=1m 
--numjobs=1 --size=100G --runtime=60




In the present situation the settings of read-ahead on the RAID 
controllers has practically non effect, we have also checked that by the 
way.


Giovanni




 >
 >
 > 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
size.
 > When you write one GPFS block, less than a half RAID stripe is
written,
 > which means you  need to read back some data to calculate new
parities.
 > I would prefer 4 MB block size, and maybe also change to 8+p+q so
that
 > one GPFS is a multiple of a full 2 MB stripe.
 >
 >
 >     -jf

we have now added another file system based on 2 NSD on RAID6 8+p+q,
keeping the 1MB block size just not to change too many things at the
same time, but no substantial change in very low readout performances,
that are still of the order of 50 MB/s while write performance are
1000MB/s

Any other suggestion is welcomed!



Maybe rule out the storage, and check if you get proper throughput from 
nsdperf?


Maybe also benchmark using "gpfsperf" instead of "lmdd", and show your 
full settings -- so that

we see that the benchmark is sane :-)



   -jf


--
Giovanni Bracco
phone  +39 351 8804788
E-mail  giovanni.bra...@enea.it
WWW http://www.afs.enea.it/bracco
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-12 Thread Aaron Knister
I would double check your cpu frequency scaling settings in your NSD servers 
(cpupower frequency-info) and look at the governor. You’ll want it to be the 
performance governor. If it’s not what can happen is the CPUs scale back their 
clock rate which hurts RDMA performance. Running the I/o test on the NSD 
servers themselves may have been enough to kick the processors up into a higher 
frequency which afforded you good performance. 

Sent from my iPhone

> On Jun 12, 2020, at 00:19, Luis Bolinches  wrote:
> 
> 
> Hi
>  
> the block for writes increases the IOPS on those cards that might be already 
> at the limit so I would not discard taht lowering the IOPS for writes has a 
> positive effect on reads or not but it is a smoking gun that needs to be 
> addressed. My experience of ignoring those is not a positive one.
>  
> In regards of this HW I woudl love to see a baseline at RAW. run FIO (or any 
> other tool that is not DD) on RAW device (not scale) to see what actually 
> each drive can do AND then all the drives at the same time. We seen RAID 
> controllers got to its needs even on reads when parallel access to many 
> drives are put into the RAID controller. That is why we had to create a tool 
> to get KPIs for ECE but can be applied here as way to see what the system can 
> do. I would build numbers for RAW before I start looking into any filesystem 
> numbers.
>  
> you can use whatever tool you like but this one if just a FIO frontend that 
> will do what I mention above 
> https://github.com/IBM/SpectrumScale_ECE_STORAGE_READINESS. If you can I 
> would also do the write part, as reads is part of the story, and you need to 
> understand what the HW can do (+1 to Lego comment before)
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations / 
> Salutacions
> Luis Bolinches
> Consultant IT Specialist
> IBM Spectrum Scale development
> ESS & client adoption teams
> Mobile Phone: +358503112585
>  
> https://www.youracclaim.com/user/luis-bolinches
>  
> Ab IBM Finland Oy
> Laajalahdentie 23
> 00330 Helsinki
> Uusimaa - Finland
> 
> "If you always give you will always have" --  Anonymous
>  
>  
>  
> - Original message -
> From: "Uwe Falke" 
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug main discussion list 
> Cc: gpfsug-discuss-boun...@spectrumscale.org, Agostino Funel 
> 
> Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance in simple 
> spectrum scale/gpfs cluster with a storage-server SAN
> Date: Thu, Jun 11, 2020 23:42
>  
> Hi Giovanni, how do the waiters look on your clients when reading?
> 
> 
> Mit freundlichen Grüßen / Kind regards
> 
> Dr. Uwe Falke
> IT Specialist
> Global Technology Services / Project Services Delivery / High Performance
> Computing
> +49 175 575 2877 Mobile
> Rathausstr. 7, 09111 Chemnitz, Germany
> uwefa...@de.ibm.com
> 
> IBM Services
> 
> IBM Data Privacy Statement
> 
> IBM Deutschland Business & Technology Services GmbH
> Geschäftsführung: Dr. Thomas Wolter, Sven Schooss
> Sitz der Gesellschaft: Ehningen
> Registergericht: Amtsgericht Stuttgart, HRB 17122
> 
> 
> 
> From:   Giovanni Bracco 
> To: gpfsug-discuss@spectrumscale.org
> Cc: Agostino Funel 
> Date:   05/06/2020 14:22
> Subject:[EXTERNAL] [gpfsug-discuss] very low read performance in
> simple spectrum scale/gpfs cluster with a storage-server SAN
> Sent by:gpfsug-discuss-boun...@spectrumscale.org
> 
> 
> 
> In our lab we have received two storage-servers, Super micro
> SSG-6049P-E1CR24L, 24 HD each (9TB SAS3), with Avago 3108 RAID
> controller (2 GB cache) and before putting them in production for other
> purposes we have setup a small GPFS test cluster to verify if they can
> be used as storage (our gpfs production cluster has the licenses based
> on the NSD sockets, so it would be interesting to expand the storage
> size just by adding storage-servers in a infiniband based SAN, without
> changing the number of NSD servers)
> 
> The test cluster consists of:
> 
> 1) two NSD servers (IBM x3550M2) with a dual port IB QDR Trues scale each.
> 2) a Mellanox FDR switch used as a SAN switch
> 3) a Truescale QDR switch as GPFS cluster switch
> 4) two GPFS clients (Supermicro AMD nodes) one port QDR each.
> 
> All the nodes run CentOS 7.7.
> 
> On each storage-server a RAID 6 volume of 11 disk, 80 TB, has been
> configured and it is exported via infiniband as an iSCSI target so that
> both appear as devices accessed by the srp_daemon on the NSD servers,
> where multipath (not really necessary in this case) has been configured
> for these two LIO-ORG devices.
> 
>

Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Luis Bolinches
Hi
 
the block for writes increases the IOPS on those cards that might be already at the limit so I would not discard taht lowering the IOPS for writes has a positive effect on reads or not but it is a smoking gun that needs to be addressed. My experience of ignoring those is not a positive one.
 
In regards of this HW I woudl love to see a baseline at RAW. run FIO (or any other tool that is not DD) on RAW device (not scale) to see what actually each drive can do AND then all the drives at the same time. We seen RAID controllers got to its needs even on reads when parallel access to many drives are put into the RAID controller. That is why we had to create a tool to get KPIs for ECE but can be applied here as way to see what the system can do. I would build numbers for RAW before I start looking into any filesystem numbers.
 
you can use whatever tool you like but this one if just a FIO frontend that will do what I mention above https://github.com/IBM/SpectrumScale_ECE_STORAGE_READINESS. If you can I would also do the write part, as reads is part of the story, and you need to understand what the HW can do (+1 to Lego comment before)
--Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations / SalutacionsLuis Bolinches
Consultant IT Specialist
IBM Spectrum Scale development
ESS & client adoption teams
Mobile Phone: +358503112585
 
https://www.youracclaim.com/user/luis-bolinches
 
Ab IBM Finland Oy
Laajalahdentie 23
00330 Helsinki
Uusimaa - Finland"If you always give you will always have" --  Anonymous
 
 
 
- Original message -From: "Uwe Falke" Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug main discussion list Cc: gpfsug-discuss-boun...@spectrumscale.org, Agostino Funel Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SANDate: Thu, Jun 11, 2020 23:42 
Hi Giovanni, how do the waiters look on your clients when reading?Mit freundlichen Grüßen / Kind regardsDr. Uwe FalkeIT SpecialistGlobal Technology Services / Project Services Delivery / High PerformanceComputing+49 175 575 2877 MobileRathausstr. 7, 09111 Chemnitz, Germanyuwefa...@de.ibm.comIBM ServicesIBM Data Privacy StatementIBM Deutschland Business & Technology Services GmbHGeschäftsführung: Dr. Thomas Wolter, Sven SchoossSitz der Gesellschaft: EhningenRegistergericht: Amtsgericht Stuttgart, HRB 17122From:   Giovanni Bracco To:     gpfsug-discuss@spectrumscale.orgCc:     Agostino Funel Date:   05/06/2020 14:22Subject:        [EXTERNAL] [gpfsug-discuss] very low read performance insimple spectrum scale/gpfs cluster with a storage-server SANSent by:        gpfsug-discuss-boun...@spectrumscale.orgIn our lab we have received two storage-servers, Super microSSG-6049P-E1CR24L, 24 HD each (9TB SAS3), with Avago 3108 RAIDcontroller (2 GB cache) and before putting them in production for otherpurposes we have setup a small GPFS test cluster to verify if they canbe used as storage (our gpfs production cluster has the licenses basedon the NSD sockets, so it would be interesting to expand the storagesize just by adding storage-servers in a infiniband based SAN, withoutchanging the number of NSD servers)The test cluster consists of:1) two NSD servers (IBM x3550M2) with a dual port IB QDR Trues scale each.2) a Mellanox FDR switch used as a SAN switch3) a Truescale QDR switch as GPFS cluster switch4) two GPFS clients (Supermicro AMD nodes) one port QDR each.All the nodes run CentOS 7.7.On each storage-server a RAID 6 volume of 11 disk, 80 TB, has beenconfigured and it is exported via infiniband as an iSCSI target so thatboth appear as devices accessed by the srp_daemon on the NSD servers,where multipath (not really necessary in this case) has been configuredfor these two LIO-ORG devices.GPFS version 5.0.4-0 has been installed and the RDMA has been properlyconfiguredTwo NSD disk have been created and a GPFS file system has been configured.Very simple tests have been performed using lmdd serial write/read.1) storage-server local performance: before configuring the RAID6 volumeas NSD disk, a local xfs file system was created and lmdd write/readperformance for 100 GB file was verified to be about 1 GB/s2) once the GPFS cluster has been created write/read test have beenperformed directly from one of the NSD server at a time:write performance 2 GB/s, read performance 1 GB/s for 100 GB fileBy checking with iostat, it was observed that the I/O in this caseinvolved only the NSD server where the test was performed, so whenwriting, the double of base performances was obtained,  while in readingthe same performance as on a local file system, this seems correct.Values are stable when the test is repeated.3) when the same test is performed from the GPFS clients the lmdd resultfor a 100 GB file are:write - 900 MB/s and stable, not too bad but half of what is seen fromthe NSD servers.read - 30 MB/s to 300 MB/s: very low and unstable valuesNo tuning of any kind in all the 

Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Uwe Falke
Hi Giovanni, how do the waiters look on your clients when reading?


Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Global Technology Services / Project Services Delivery / High Performance 
Computing
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefa...@de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Dr. Thomas Wolter, Sven Schooss
Sitz der Gesellschaft: Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122



From:   Giovanni Bracco 
To: gpfsug-discuss@spectrumscale.org
Cc: Agostino Funel 
Date:   05/06/2020 14:22
Subject:[EXTERNAL] [gpfsug-discuss] very low read performance in 
simple spectrum scale/gpfs cluster with a storage-server SAN
Sent by:gpfsug-discuss-boun...@spectrumscale.org



In our lab we have received two storage-servers, Super micro 
SSG-6049P-E1CR24L, 24 HD each (9TB SAS3), with Avago 3108 RAID 
controller (2 GB cache) and before putting them in production for other 
purposes we have setup a small GPFS test cluster to verify if they can 
be used as storage (our gpfs production cluster has the licenses based 
on the NSD sockets, so it would be interesting to expand the storage 
size just by adding storage-servers in a infiniband based SAN, without 
changing the number of NSD servers)

The test cluster consists of:

1) two NSD servers (IBM x3550M2) with a dual port IB QDR Trues scale each.
2) a Mellanox FDR switch used as a SAN switch
3) a Truescale QDR switch as GPFS cluster switch
4) two GPFS clients (Supermicro AMD nodes) one port QDR each.

All the nodes run CentOS 7.7.

On each storage-server a RAID 6 volume of 11 disk, 80 TB, has been 
configured and it is exported via infiniband as an iSCSI target so that 
both appear as devices accessed by the srp_daemon on the NSD servers, 
where multipath (not really necessary in this case) has been configured 
for these two LIO-ORG devices.

GPFS version 5.0.4-0 has been installed and the RDMA has been properly 
configured

Two NSD disk have been created and a GPFS file system has been configured.

Very simple tests have been performed using lmdd serial write/read.

1) storage-server local performance: before configuring the RAID6 volume 
as NSD disk, a local xfs file system was created and lmdd write/read 
performance for 100 GB file was verified to be about 1 GB/s

2) once the GPFS cluster has been created write/read test have been 
performed directly from one of the NSD server at a time:

write performance 2 GB/s, read performance 1 GB/s for 100 GB file

By checking with iostat, it was observed that the I/O in this case 
involved only the NSD server where the test was performed, so when 
writing, the double of base performances was obtained,  while in reading 
the same performance as on a local file system, this seems correct.
Values are stable when the test is repeated.

3) when the same test is performed from the GPFS clients the lmdd result 
for a 100 GB file are:

write - 900 MB/s and stable, not too bad but half of what is seen from 
the NSD servers.

read - 30 MB/s to 300 MB/s: very low and unstable values

No tuning of any kind in all the configuration of the involved system, 
only default values.

Any suggestion to explain the very bad  read performance from a GPFS 
client?

Giovanni

here are the configuration of the virtual drive on the storage-server 
and the file system configuration in GPFS


Virtual drive
==

Virtual Drive: 2 (Target Id: 2)
Name:
RAID Level  : Primary-6, Secondary-0, RAID Level Qualifier-3
Size: 81.856 TB
Sector Size : 512
Is VD emulated  : Yes
Parity Size : 18.190 TB
State   : Optimal
Strip Size  : 256 KB
Number Of Drives: 11
Span Depth  : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if 
Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if 
Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disabled


GPFS file system from mmlsfs


mmlsfs vsd_gexp2
flagvaluedescription
---  
---
  -f 8192 Minimum fragment 
(subblock) size in bytes
  -i 4096 Inode size in bytes
  -I 32768Indirect block size in bytes
  -m 1Default number of metadata 
replicas
  -M 2Maximum number of metadata 
replicas
  -r 1Default number of data 
replicas
  -R 2Maximum number of data 
replicas
  -j cluster  Block allocation type
  -D nfs4 File 

Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Uwe Falke
While that point (block size should be an integer multiple of the RAID 
stripe width) is a good one, its violation would explain slow writes, but 
Giovanni talks of slow reads ...

Mit freundlichen Grüßen / Kind regards

Dr. Uwe Falke
IT Specialist
Global Technology Services / Project Services Delivery / High Performance 
Computing
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefa...@de.ibm.com

IBM Services

IBM Data Privacy Statement

IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Dr. Thomas Wolter, Sven Schooss
Sitz der Gesellschaft: Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122



From:   "Luis Bolinches" 
To: "Giovanni Bracco" 
Cc: gpfsug main discussion list , 
agostino.fu...@enea.it
Date:   11/06/2020 16:11
Subject:    [EXTERNAL] Re: [gpfsug-discuss] very low read performance 
in simple spectrum scale/gpfs cluster with a storage-server SAN
Sent by:gpfsug-discuss-boun...@spectrumscale.org



8 data * 256K does not align to your 1MB 

Raid 6 is already not the best option for writes. I would look into use 
multiples of 2MB block sizes. 

--
Cheers

> On 11. Jun 2020, at 17.07, Giovanni Bracco  
wrote:
> 
> 256K
> 
> Giovanni
> 
>> On 11/06/20 10:01, Luis Bolinches wrote:
>> On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
>> --
>> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations 

>> / Salutacions
>> Luis Bolinches
>> Consultant IT Specialist
>> IBM Spectrum Scale development
>> ESS & client adoption teams
>> Mobile Phone: +358503112585
>> 
*https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youracclaim.com_user_luis-2Dbolinches-2A=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk=0sBbPyJrNuU4BjRb4Cv2f8Z0ot7MiVpqshdkyAHqiuE=
 

>> Ab IBM Finland Oy
>> Laajalahdentie 23
>> 00330 Helsinki
>> Uusimaa - Finland
>> 
>> *"If you always give you will always have" -- Anonymous*
>> 
>> - Original message -
>> From: Giovanni Bracco 
>> Sent by: gpfsug-discuss-boun...@spectrumscale.org
>> To: Jan-Frode Myklebust , gpfsug main discussion
>> list 
>> Cc: Agostino Funel 
>> Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
>> in simple spectrum scale/gpfs cluster with a storage-server SAN
>> Date: Thu, Jun 11, 2020 10:53
>> Comments and updates in the text:
>> 
>>> On 05/06/20 19:02, Jan-Frode Myklebust wrote:
>>> fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco
>>> mailto:giovanni.bra...@enea.it>>:
>>> 
>>> answer in the text
>>> 
>>>> On 05/06/20 14:58, Jan-Frode Myklebust wrote:
>>> >
>>> > Could maybe be interesting to drop the NSD servers, and
>> let all
>>> nodes
>>> > access the storage via srp ?
>>> 
>>> no we can not: the production clusters fabric is a mix of a
>> QDR based
>>> cluster and a OPA based cluster and NSD nodes provide the
>> service to
>>> both.
>>> 
>>> 
>>> You could potentially still do SRP from QDR nodes, and via NSD
>> for your
>>> omnipath nodes. Going via NSD seems like a bit pointless indirection.
>> 
>> not really: both clusters, the 400 OPA nodes and the 300 QDR nodes 
share
>> the same data lake in Spectrum Scale/GPFS so the NSD servers support 
the
>> flexibility of the setup.
>> 
>> NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at
>> the moment 3 different generations of DDN storages are connected,
>> 9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some 
less
>> expensive storage, to be used when performance is not the first
>> priority.
>> 
>>> 
>>> 
>>> 
>>> >
>>> > Maybe turn off readahead, since it can cause performance
>> degradation
>>> > when GPFS reads 1 MB blocks scattered on the NSDs, so that
>>> read-ahead
>>> > always reads too much. This might be the cause of the slow
>> read
>>> seen ?
>>> > maybe you?ll also overflow it if reading from both
>> NSD-servers at
>>> the
>>> > same time?
>>> 
>>> I have switched the readahead off and this produced a small
>> (~10%)
>>> increase of performances when reading from a NSD server, but
>> no change
>>> in the bad behaviour for the GPFS clients
>>> 
>>> 
>>> >
>>> >
>>> > Plus.. it?s always nice to give a bit more pagepool to hhe
&g

Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Giovanni Bracco

256K

Giovanni

On 11/06/20 10:01, Luis Bolinches wrote:

On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
--
Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations 
/ Salutacions

Luis Bolinches
Consultant IT Specialist
IBM Spectrum Scale development
ESS & client adoption teams
Mobile Phone: +358503112585
*https://www.youracclaim.com/user/luis-bolinches*
Ab IBM Finland Oy
Laajalahdentie 23
00330 Helsinki
Uusimaa - Finland

*"If you always give you will always have" --  Anonymous*

- Original message -
From: Giovanni Bracco 
Sent by: gpfsug-discuss-boun...@spectrumscale.org
To: Jan-Frode Myklebust , gpfsug main discussion
list 
Cc: Agostino Funel 
Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
in simple spectrum scale/gpfs cluster with a storage-server SAN
Date: Thu, Jun 11, 2020 10:53
Comments and updates in the text:

On 05/06/20 19:02, Jan-Frode Myklebust wrote:
 > fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco
 > mailto:giovanni.bra...@enea.it>>:
 >
 >     answer in the text
 >
 >     On 05/06/20 14:58, Jan-Frode Myklebust wrote:
 >      >
 >      > Could maybe be interesting to drop the NSD servers, and
let all
 >     nodes
 >      > access the storage via srp ?
 >
 >     no we can not: the production clusters fabric is a mix of a
QDR based
 >     cluster and a OPA based cluster and NSD nodes provide the
service to
 >     both.
 >
 >
 > You could potentially still do SRP from QDR nodes, and via NSD
for your
 > omnipath nodes. Going via NSD seems like a bit pointless indirection.

not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share
the same data lake in Spectrum Scale/GPFS so the NSD servers support the
flexibility of the setup.

NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at
the moment 3 different generations of DDN storages are connected,
9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some less
expensive storage, to be used when performance is not the first
priority.

 >
 >
 >
 >      >
 >      > Maybe turn off readahead, since it can cause performance
degradation
 >      > when GPFS reads 1 MB blocks scattered on the NSDs, so that
 >     read-ahead
 >      > always reads too much. This might be the cause of the slow
read
 >     seen —
 >      > maybe you’ll also overflow it if reading from both
NSD-servers at
 >     the
 >      > same time?
 >
 >     I have switched the readahead off and this produced a small
(~10%)
 >     increase of performances when reading from a NSD server, but
no change
 >     in the bad behaviour for the GPFS clients
 >
 >
 >      >
 >      >
 >      > Plus.. it’s always nice to give a bit more pagepool to hhe
 >     clients than
 >      > the default.. I would prefer to start with 4 GB.
 >
 >     we'll do also that and we'll let you know!
 >
 >
 > Could you show your mmlsconfig? Likely you should set maxMBpS to
 > indicate what kind of throughput a client can do (affects GPFS
 > readahead/writebehind).  Would typically also increase
workerThreads on
 > your NSD servers.

At this moment this is the output of mmlsconfig

# mmlsconfig
Configuration data for cluster GPFSEXP.portici.enea.it:
---
clusterName GPFSEXP.portici.enea.it
clusterId 13274694257874519577
autoload no
dmapiFileHandleSize 32
minReleaseLevel 5.0.4.0
ccrEnabled yes
cipherList AUTHONLY
verbsRdma enable
verbsPorts qib0/1
[cresco-gpfq7,cresco-gpfq8]
verbsPorts qib0/2
[common]
pagepool 4G
adminMode central

File systems in cluster GPFSEXP.portici.enea.it:

/dev/vsd_gexp2
/dev/vsd_gexp3


 >
 >
 > 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
size.
 > When you write one GPFS block, less than a half RAID stripe is
written,
 > which means you  need to read back some data to calculate new
parities.
 > I would prefer 4 MB block size, and maybe also change to 8+p+q so
that
 > one GPFS is a multiple of a full 2 MB stripe.
 >
 >
 >     -jf

we have now added another file system based on 2 NSD on RAID6 8+p+q,
keeping the 1MB block size just not to change too many things at the
same time, but no substantial change in very low readout performances,
that are still of the order

Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Jan-Frode Myklebust
On Thu, Jun 11, 2020 at 9:53 AM Giovanni Bracco 
wrote:

>
> >
> > You could potentially still do SRP from QDR nodes, and via NSD for your
> > omnipath nodes. Going via NSD seems like a bit pointless indirection.
>
> not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share
> the same data lake in Spectrum Scale/GPFS so the NSD servers support the
> flexibility of the setup.
>

Maybe there's something I don't understand, but couldn't you use the
NSD-servers to serve to your
OPA nodes, and then SRP directly for your 300 QDR-nodes??


> At this moment this is the output of mmlsconfig
>
> # mmlsconfig
> Configuration data for cluster GPFSEXP.portici.enea.it:
> ---
> clusterName GPFSEXP.portici.enea.it
> clusterId 13274694257874519577
> autoload no
> dmapiFileHandleSize 32
> minReleaseLevel 5.0.4.0
> ccrEnabled yes
> cipherList AUTHONLY
> verbsRdma enable
> verbsPorts qib0/1
> [cresco-gpfq7,cresco-gpfq8]
> verbsPorts qib0/2
> [common]
> pagepool 4G
> adminMode central
>
> File systems in cluster GPFSEXP.portici.enea.it:
> 
> /dev/vsd_gexp2
> /dev/vsd_gexp3
>
>

So, trivial close to default config.. assume the same for the client
cluster.

I would correct MaxMBpS -- put it at something reasonable, enable
verbsRdmaSend=yes and
ignorePrefetchLUNCount=yes.



>
> >
> >
> > 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip size.
> > When you write one GPFS block, less than a half RAID stripe is written,
> > which means you  need to read back some data to calculate new parities.
> > I would prefer 4 MB block size, and maybe also change to 8+p+q so that
> > one GPFS is a multiple of a full 2 MB stripe.
> >
> >
> > -jf
>
> we have now added another file system based on 2 NSD on RAID6 8+p+q,
> keeping the 1MB block size just not to change too many things at the
> same time, but no substantial change in very low readout performances,
> that are still of the order of 50 MB/s while write performance are 1000MB/s
>
> Any other suggestion is welcomed!
>
>

Maybe rule out the storage, and check if you get proper throughput from
nsdperf?

Maybe also benchmark using "gpfsperf" instead of "lmdd", and show your full
settings -- so that
we see that the benchmark is sane :-)



  -jf
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Jonathan Buzzard

On 11/06/2020 08:53, Giovanni Bracco wrote:

[SNIP]

not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share 
the same data lake in Spectrum Scale/GPFS so the NSD servers support the 
flexibility of the setup.


NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at 
the moment 3 different generations of DDN storages are connected, 
9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some less 
expensive storage, to be used when performance is not the first priority.




Ring up Lenovo and get a pricing on some DSS-G storage :-)

They can be configured with OPA and Infiniband (though I am not sure if 
both at the same time) and are only slightly more expensive than the 
traditional DIY Lego brick approach.


JAB.

--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Luis Bolinches
On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
--Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations / SalutacionsLuis Bolinches
Consultant IT Specialist
IBM Spectrum Scale development
ESS & client adoption teams
Mobile Phone: +358503112585
 
https://www.youracclaim.com/user/luis-bolinches
 
Ab IBM Finland Oy
Laajalahdentie 23
00330 Helsinki
Uusimaa - Finland"If you always give you will always have" --  Anonymous
 
 
 
- Original message -From: Giovanni Bracco Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: Jan-Frode Myklebust , gpfsug main discussion list Cc: Agostino Funel Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SANDate: Thu, Jun 11, 2020 10:53 
Comments and updates in the text:On 05/06/20 19:02, Jan-Frode Myklebust wrote:> fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco> mailto:giovanni.bra...@enea.it>>:>>     answer in the text>>     On 05/06/20 14:58, Jan-Frode Myklebust wrote:>      >>      > Could maybe be interesting to drop the NSD servers, and let all>     nodes>      > access the storage via srp ?>>     no we can not: the production clusters fabric is a mix of a QDR based>     cluster and a OPA based cluster and NSD nodes provide the service to>     both.>>> You could potentially still do SRP from QDR nodes, and via NSD for your> omnipath nodes. Going via NSD seems like a bit pointless indirection.not really: both clusters, the 400 OPA nodes and the 300 QDR nodes sharethe same data lake in Spectrum Scale/GPFS so the NSD servers support theflexibility of the setup.NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where atthe moment 3 different generations of DDN storages are connected,9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some lessexpensive storage, to be used when performance is not the first priority.      >>      > Maybe turn off readahead, since it can cause performance degradation>      > when GPFS reads 1 MB blocks scattered on the NSDs, so that>     read-ahead>      > always reads too much. This might be the cause of the slow read>     seen —>      > maybe you’ll also overflow it if reading from both NSD-servers at>     the>      > same time?>>     I have switched the readahead off and this produced a small (~10%)>     increase of performances when reading from a NSD server, but no change>     in the bad behaviour for the GPFS clients>>>      >>      >>      > Plus.. it’s always nice to give a bit more pagepool to hhe>     clients than>      > the default.. I would prefer to start with 4 GB.>>     we'll do also that and we'll let you know!>>> Could you show your mmlsconfig? Likely you should set maxMBpS to> indicate what kind of throughput a client can do (affects GPFS> readahead/writebehind).  Would typically also increase workerThreads on> your NSD servers.At this moment this is the output of mmlsconfig# mmlsconfigConfiguration data for cluster GPFSEXP.portici.enea.it:---clusterName GPFSEXP.portici.enea.itclusterId 13274694257874519577autoload nodmapiFileHandleSize 32minReleaseLevel 5.0.4.0ccrEnabled yescipherList AUTHONLYverbsRdma enableverbsPorts qib0/1[cresco-gpfq7,cresco-gpfq8]verbsPorts qib0/2[common]pagepool 4GadminMode centralFile systems in cluster GPFSEXP.portici.enea.it:/dev/vsd_gexp2/dev/vsd_gexp3>>> 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip size.> When you write one GPFS block, less than a half RAID stripe is written,> which means you  need to read back some data to calculate new parities.> I would prefer 4 MB block size, and maybe also change to 8+p+q so that> one GPFS is a multiple of a full 2 MB stripe.>>>     -jfwe have now added another file system based on 2 NSD on RAID6 8+p+q,keeping the 1MB block size just not to change too many things at thesame time, but no substantial change in very low readout performances,that are still of the order of 50 MB/s while write performance are 1000MB/sAny other suggestion is welcomed!Giovanni--Giovanni Braccophone  +39 351 8804788E-mail  giovanni.bra...@enea.itWWW http://www.afs.enea.it/bracco ___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss  
 
Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3 
Registered in Finland


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-11 Thread Giovanni Bracco

Comments and updates in the text:

On 05/06/20 19:02, Jan-Frode Myklebust wrote:
fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco 
mailto:giovanni.bra...@enea.it>>:


answer in the text

On 05/06/20 14:58, Jan-Frode Myklebust wrote:
 >
 > Could maybe be interesting to drop the NSD servers, and let all
nodes
 > access the storage via srp ?

no we can not: the production clusters fabric is a mix of a QDR based
cluster and a OPA based cluster and NSD nodes provide the service to
both.


You could potentially still do SRP from QDR nodes, and via NSD for your 
omnipath nodes. Going via NSD seems like a bit pointless indirection.


not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share 
the same data lake in Spectrum Scale/GPFS so the NSD servers support the 
flexibility of the setup.


NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at 
the moment 3 different generations of DDN storages are connected, 
9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some less 
expensive storage, to be used when performance is not the first priority.






 >
 > Maybe turn off readahead, since it can cause performance degradation
 > when GPFS reads 1 MB blocks scattered on the NSDs, so that
read-ahead
 > always reads too much. This might be the cause of the slow read
seen —
 > maybe you’ll also overflow it if reading from both NSD-servers at
the
 > same time?

I have switched the readahead off and this produced a small (~10%)
increase of performances when reading from a NSD server, but no change
in the bad behaviour for the GPFS clients


 >
 >
 > Plus.. it’s always nice to give a bit more pagepool to hhe
clients than
 > the default.. I would prefer to start with 4 GB.

we'll do also that and we'll let you know!


Could you show your mmlsconfig? Likely you should set maxMBpS to 
indicate what kind of throughput a client can do (affects GPFS 
readahead/writebehind).  Would typically also increase workerThreads on 
your NSD servers.


At this moment this is the output of mmlsconfig

# mmlsconfig
Configuration data for cluster GPFSEXP.portici.enea.it:
---
clusterName GPFSEXP.portici.enea.it
clusterId 13274694257874519577
autoload no
dmapiFileHandleSize 32
minReleaseLevel 5.0.4.0
ccrEnabled yes
cipherList AUTHONLY
verbsRdma enable
verbsPorts qib0/1
[cresco-gpfq7,cresco-gpfq8]
verbsPorts qib0/2
[common]
pagepool 4G
adminMode central

File systems in cluster GPFSEXP.portici.enea.it:

/dev/vsd_gexp2
/dev/vsd_gexp3





1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip size. 
When you write one GPFS block, less than a half RAID stripe is written, 
which means you  need to read back some data to calculate new parities. 
I would prefer 4 MB block size, and maybe also change to 8+p+q so that 
one GPFS is a multiple of a full 2 MB stripe.



    -jf


we have now added another file system based on 2 NSD on RAID6 8+p+q, 
keeping the 1MB block size just not to change too many things at the 
same time, but no substantial change in very low readout performances, 
that are still of the order of 50 MB/s while write performance are 1000MB/s


Any other suggestion is welcomed!

Giovanni



--
Giovanni Bracco
phone  +39 351 8804788
E-mail  giovanni.bra...@enea.it
WWW http://www.afs.enea.it/bracco
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-05 Thread Jan-Frode Myklebust
fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco :

> answer in the text
>
> On 05/06/20 14:58, Jan-Frode Myklebust wrote:
> >
> > Could maybe be interesting to drop the NSD servers, and let all nodes
> > access the storage via srp ?
>
> no we can not: the production clusters fabric is a mix of a QDR based
> cluster and a OPA based cluster and NSD nodes provide the service to both.
>

You could potentially still do SRP from QDR nodes, and via NSD for your
omnipath nodes. Going via NSD seems like a bit pointless indirection.



> >
> > Maybe turn off readahead, since it can cause performance degradation
> > when GPFS reads 1 MB blocks scattered on the NSDs, so that read-ahead
> > always reads too much. This might be the cause of the slow read seen —
> > maybe you’ll also overflow it if reading from both NSD-servers at the
> > same time?
>
> I have switched the readahead off and this produced a small (~10%)
> increase of performances when reading from a NSD server, but no change
> in the bad behaviour for the GPFS clients


> >
> >
> > Plus.. it’s always nice to give a bit more pagepool to hhe clients than
> > the default.. I would prefer to start with 4 GB.
>
> we'll do also that and we'll let you know!


Could you show your mmlsconfig? Likely you should set maxMBpS to indicate
what kind of throughput a client can do (affects GPFS
readahead/writebehind).  Would typically also increase workerThreads on
your NSD servers.


1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip size.
When you write one GPFS block, less than a half RAID stripe is written,
which means you  need to read back some data to calculate new parities. I
would prefer 4 MB block size, and maybe also change to 8+p+q so that one
GPFS is a multiple of a full 2 MB stripe.


   -jf
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-05 Thread Giovanni Bracco

answer in the text

On 05/06/20 14:58, Jan-Frode Myklebust wrote:


Could maybe be interesting to drop the NSD servers, and let all nodes 
access the storage via srp ?


no we can not: the production clusters fabric is a mix of a QDR based 
cluster and a OPA based cluster and NSD nodes provide the service to both.




Maybe turn off readahead, since it can cause performance degradation 
when GPFS reads 1 MB blocks scattered on the NSDs, so that read-ahead 
always reads too much. This might be the cause of the slow read seen — 
maybe you’ll also overflow it if reading from both NSD-servers at the 
same time?


I have switched the readahead off and this produced a small (~10%) 
increase of performances when reading from a NSD server, but no change 
in the bad behaviour for the GPFS clients





Plus.. it’s always nice to give a bit more pagepool to hhe clients than 
the default.. I would prefer to start with 4 GB.


we'll do also that and we'll let you know!

Giovanni





   -jf

fre. 5. jun. 2020 kl. 14:22 skrev Giovanni Bracco 
mailto:giovanni.bra...@enea.it>>:


In our lab we have received two storage-servers, Super micro
SSG-6049P-E1CR24L, 24 HD each (9TB SAS3), with Avago 3108 RAID
controller (2 GB cache) and before putting them in production for other
purposes we have setup a small GPFS test cluster to verify if they can
be used as storage (our gpfs production cluster has the licenses based
on the NSD sockets, so it would be interesting to expand the storage
size just by adding storage-servers in a infiniband based SAN, without
changing the number of NSD servers)

The test cluster consists of:

1) two NSD servers (IBM x3550M2) with a dual port IB QDR Trues scale
each.
2) a Mellanox FDR switch used as a SAN switch
3) a Truescale QDR switch as GPFS cluster switch
4) two GPFS clients (Supermicro AMD nodes) one port QDR each.

All the nodes run CentOS 7.7.

On each storage-server a RAID 6 volume of 11 disk, 80 TB, has been
configured and it is exported via infiniband as an iSCSI target so that
both appear as devices accessed by the srp_daemon on the NSD servers,
where multipath (not really necessary in this case) has been configured
for these two LIO-ORG devices.

GPFS version 5.0.4-0 has been installed and the RDMA has been properly
configured

Two NSD disk have been created and a GPFS file system has been
configured.

Very simple tests have been performed using lmdd serial write/read.

1) storage-server local performance: before configuring the RAID6
volume
as NSD disk, a local xfs file system was created and lmdd write/read
performance for 100 GB file was verified to be about 1 GB/s

2) once the GPFS cluster has been created write/read test have been
performed directly from one of the NSD server at a time:

write performance 2 GB/s, read performance 1 GB/s for 100 GB file

By checking with iostat, it was observed that the I/O in this case
involved only the NSD server where the test was performed, so when
writing, the double of base performances was obtained,  while in
reading
the same performance as on a local file system, this seems correct.
Values are stable when the test is repeated.

3) when the same test is performed from the GPFS clients the lmdd
result
for a 100 GB file are:

write - 900 MB/s and stable, not too bad but half of what is seen from
the NSD servers.

read - 30 MB/s to 300 MB/s: very low and unstable values

No tuning of any kind in all the configuration of the involved system,
only default values.

Any suggestion to explain the very bad  read performance from a GPFS
client?

Giovanni

here are the configuration of the virtual drive on the storage-server
and the file system configuration in GPFS


Virtual drive
==

Virtual Drive: 2 (Target Id: 2)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 81.856 TB
Sector Size         : 512
Is VD emulated      : Yes
Parity Size         : 18.190 TB
State               : Optimal
Strip Size          : 256 KB
Number Of Drives    : 11
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if
Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if
Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disabled


GPFS file system from mmlsfs


mmlsfs vsd_gexp2
flag                value                    description
--- 
---
   -f                 8192                     Minimum fragment
(subblock) size in bytes
   -i                 4096                     Inode 

Re: [gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

2020-06-05 Thread Jan-Frode Myklebust
Could maybe be interesting to drop the NSD servers, and let all nodes
access the storage via srp ?

Maybe turn off readahead, since it can cause performance degradation when
GPFS reads 1 MB blocks scattered on the NSDs, so that read-ahead always
reads too much. This might be the cause of the slow read seen — maybe
you’ll also overflow it if reading from both NSD-servers at the same time?


Plus.. it’s always nice to give a bit more pagepool to hhe clients than the
default.. I would prefer to start with 4 GB.



  -jf

fre. 5. jun. 2020 kl. 14:22 skrev Giovanni Bracco :

> In our lab we have received two storage-servers, Super micro
> SSG-6049P-E1CR24L, 24 HD each (9TB SAS3), with Avago 3108 RAID
> controller (2 GB cache) and before putting them in production for other
> purposes we have setup a small GPFS test cluster to verify if they can
> be used as storage (our gpfs production cluster has the licenses based
> on the NSD sockets, so it would be interesting to expand the storage
> size just by adding storage-servers in a infiniband based SAN, without
> changing the number of NSD servers)
>
> The test cluster consists of:
>
> 1) two NSD servers (IBM x3550M2) with a dual port IB QDR Trues scale each.
> 2) a Mellanox FDR switch used as a SAN switch
> 3) a Truescale QDR switch as GPFS cluster switch
> 4) two GPFS clients (Supermicro AMD nodes) one port QDR each.
>
> All the nodes run CentOS 7.7.
>
> On each storage-server a RAID 6 volume of 11 disk, 80 TB, has been
> configured and it is exported via infiniband as an iSCSI target so that
> both appear as devices accessed by the srp_daemon on the NSD servers,
> where multipath (not really necessary in this case) has been configured
> for these two LIO-ORG devices.
>
> GPFS version 5.0.4-0 has been installed and the RDMA has been properly
> configured
>
> Two NSD disk have been created and a GPFS file system has been configured.
>
> Very simple tests have been performed using lmdd serial write/read.
>
> 1) storage-server local performance: before configuring the RAID6 volume
> as NSD disk, a local xfs file system was created and lmdd write/read
> performance for 100 GB file was verified to be about 1 GB/s
>
> 2) once the GPFS cluster has been created write/read test have been
> performed directly from one of the NSD server at a time:
>
> write performance 2 GB/s, read performance 1 GB/s for 100 GB file
>
> By checking with iostat, it was observed that the I/O in this case
> involved only the NSD server where the test was performed, so when
> writing, the double of base performances was obtained,  while in reading
> the same performance as on a local file system, this seems correct.
> Values are stable when the test is repeated.
>
> 3) when the same test is performed from the GPFS clients the lmdd result
> for a 100 GB file are:
>
> write - 900 MB/s and stable, not too bad but half of what is seen from
> the NSD servers.
>
> read - 30 MB/s to 300 MB/s: very low and unstable values
>
> No tuning of any kind in all the configuration of the involved system,
> only default values.
>
> Any suggestion to explain the very bad  read performance from a GPFS
> client?
>
> Giovanni
>
> here are the configuration of the virtual drive on the storage-server
> and the file system configuration in GPFS
>
>
> Virtual drive
> ==
>
> Virtual Drive: 2 (Target Id: 2)
> Name:
> RAID Level  : Primary-6, Secondary-0, RAID Level Qualifier-3
> Size: 81.856 TB
> Sector Size : 512
> Is VD emulated  : Yes
> Parity Size : 18.190 TB
> State   : Optimal
> Strip Size  : 256 KB
> Number Of Drives: 11
> Span Depth  : 1
> Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if
> Bad BBU
> Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if
> Bad BBU
> Default Access Policy: Read/Write
> Current Access Policy: Read/Write
> Disk Cache Policy   : Disabled
>
>
> GPFS file system from mmlsfs
> 
>
> mmlsfs vsd_gexp2
> flagvaluedescription
> --- 
> ---
>   -f 8192 Minimum fragment
> (subblock) size in bytes
>   -i 4096 Inode size in bytes
>   -I 32768Indirect block size in bytes
>   -m 1Default number of metadata
> replicas
>   -M 2Maximum number of metadata
> replicas
>   -r 1Default number of data
> replicas
>   -R 2Maximum number of data
> replicas
>   -j cluster  Block allocation type
>   -D nfs4 File locking semantics in
> effect
>   -k all  ACL semantics in effect