Re: [gpfsug-discuss] pagepool shrink doesn't release all memory

2018-02-22 Thread Aaron Knister
This is also interesting (although I don't know what it really means). 
Looking at pmap run against mmfsd I can see what happens after each step:


# baseline
7fffe4639000  59164K  0K  0K  0K  0K ---p [anon]
7fffd837e000  61960K  0K  0K  0K  0K ---p [anon]
0200 1048576K 1048576K 1048576K 1048576K  0K rwxp [anon]
Total:   1613580K 1191020K 1189650K 1171836K  0K

# tschpool 64G
7fffe4639000  59164K  0K  0K  0K  0K ---p [anon]
7fffd837e000  61960K  0K  0K  0K  0K ---p [anon]
0200 67108864K 67108864K 67108864K 67108864K  0K rwxp [anon]
Total:   67706636K 67284108K 67282625K 67264920K  0K

# tschpool 1G
7fffe4639000  59164K  0K  0K  0K  0K ---p [anon]
7fffd837e000  61960K  0K  0K  0K  0K ---p [anon]
02000140 139264K 139264K 139264K 139264K  0K rwxp [anon]
020fc940 897024K 897024K 897024K 897024K  0K rwxp [anon]
020009c0 66052096K  0K  0K  0K  0K rwxp [anon]
Total:   67706636K 1223820K 1222451K 1204632K  0K

Even though mmfsd has that 64G chunk allocated there's none of it 
*used*. I wonder why Linux seems to be accounting it as allocated.


-Aaron

On 2/22/18 10:17 PM, Aaron Knister wrote:
I've been exploring the idea for a while of writing a SLURM SPANK plugin 
to allow users to dynamically change the pagepool size on a node. Every 
now and then we have some users who would benefit significantly from a 
much larger pagepool on compute nodes but by default keep it on the 
smaller side to make as much physmem available as possible to batch work.


In testing, though, it seems as though reducing the pagepool doesn't 
quite release all of the memory. I don't really understand it because 
I've never before seen memory that was previously resident become 
un-resident but still maintain the virtual memory allocation.


Here's what I mean. Let's take a node with 128G and a 1G pagepool.

If I do the following to simulate what might happen as various jobs 
tweak the pagepool:


- tschpool 64G
- tschpool 1G
- tschpool 32G
- tschpool 1G
- tschpool 32G

I end up with this:

mmfsd thinks there's 32G resident but 64G virt
# ps -o vsz,rss,comm -p 24397
    VSZ   RSS COMMAND
67589400 33723236 mmfsd

however, linux thinks there's ~100G used

# free -g
  total   used   free shared    buffers cached
Mem:   125    100 25  0  0  0
-/+ buffers/cache: 98 26
Swap:    7  0  7

I can jump back and forth between 1G and 32G *after* allocating 64G 
pagepool and the overall amount of memory in use doesn't balloon but I 
can't seem to shed that original 64G.


I don't understand what's going on... :) Any ideas? This is with Scale 
4.2.3.6.


-Aaron



--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] pagepool shrink doesn't release all memory

2018-02-22 Thread Aaron Knister
I've been exploring the idea for a while of writing a SLURM SPANK plugin 
to allow users to dynamically change the pagepool size on a node. Every 
now and then we have some users who would benefit significantly from a 
much larger pagepool on compute nodes but by default keep it on the 
smaller side to make as much physmem available as possible to batch work.


In testing, though, it seems as though reducing the pagepool doesn't 
quite release all of the memory. I don't really understand it because 
I've never before seen memory that was previously resident become 
un-resident but still maintain the virtual memory allocation.


Here's what I mean. Let's take a node with 128G and a 1G pagepool.

If I do the following to simulate what might happen as various jobs 
tweak the pagepool:


- tschpool 64G
- tschpool 1G
- tschpool 32G
- tschpool 1G
- tschpool 32G

I end up with this:

mmfsd thinks there's 32G resident but 64G virt
# ps -o vsz,rss,comm -p 24397
   VSZ   RSS COMMAND
67589400 33723236 mmfsd

however, linux thinks there's ~100G used

# free -g
 total   used   free sharedbuffers cached
Mem:   125100 25  0  0  0
-/+ buffers/cache: 98 26
Swap:7  0  7

I can jump back and forth between 1G and 32G *after* allocating 64G 
pagepool and the overall amount of memory in use doesn't balloon but I 
can't seem to shed that original 64G.


I don't understand what's going on... :) Any ideas? This is with Scale 
4.2.3.6.


-Aaron

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread valleru
Thanks, I will try the file heat feature but i am really not sure, if it would 
work - since the code can access cold files too, and not necessarily files 
recently accessed/hot files.

With respect to LROC. Let me explain as below:

The use case is that -
The code initially reads headers (small region of data) from thousands of files 
as the first step. For example about 30,000 of them with each about 300MB to 
500MB in size.
After the first step, with the help of those headers - it mmaps/seeks across 
various regions of a set of files in parallel.
Since its all small IOs and it was really slow at reading from GPFS over the 
network directly from disks - Our idea was to use AFM which i believe fetches 
all file data into flash/ssds, once the initial few blocks of the files are 
read.
But again - AFM seems to not solve the problem, so i want to know if LROC 
behaves in the same way as AFM, where all of the file data is prefetched in 
full block size utilizing all the worker threads  - if few blocks of the file 
is read initially.

Thanks,
Lohit

On Feb 22, 2018, 4:52 PM -0500, IBM Spectrum Scale , wrote:
> My apologies for not being more clear on the flash storage pool.  I meant 
> that this would be just another GPFS storage pool in the same cluster, so no 
> separate AFM cache cluster.  You would then use the file heat feature to 
> ensure more frequently accessed files are migrated to that all flash storage 
> pool.
>
> As for LROC could you please clarify what you mean by a few headers/stubs of 
> the file?  In reading the LROC documentation and the LROC variables available 
> in the mmchconfig command I think you might want to take a look a the 
> lrocDataStubFileSize variable since it seems to apply to your situation.
>
> Regards, The Spectrum Scale (GPFS) team
>
> --
> If you feel that your question can benefit other users of  Spectrum Scale 
> (GPFS), then please post it to the public IBM developerWroks Forum at 
> https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479.
>
> If your query concerns a potential software error in Spectrum Scale (GPFS) 
> and you have an IBM software maintenance contract please contact  
> 1-800-237-5511 in the United States or your local IBM Service Center in other 
> countries.
>
> The forum is informally monitored as time permits and should not be used for 
> priority messages to the Spectrum Scale (GPFS) team.
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        gpfsug main discussion list 
> Cc:        gpfsug-discuss-boun...@spectrumscale.org
> Date:        02/22/2018 04:21 PM
> Subject:        Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Thank you.
>
> I am sorry if i was not clear, but the metadata pool is all on SSDs in the 
> GPFS clusters that we use. Its just the data pool that is on Near-Line 
> Rotating disks.
> I understand that AFM might not be able to solve the issue, and I will try 
> and see if file heat works for migrating the files to flash tier.
> You mentioned an all flash storage pool for heavily used files - so you mean 
> a different GPFS cluster just with flash storage, and to manually copy the 
> files to flash storage whenever needed?
> The IO performance that i am talking is prominently for reads, so you mention 
> that LROC can work in the way i want it to? that is prefetch all the files 
> into LROC cache, after only few headers/stubs of data are read from those 
> files?
> I thought LROC only keeps that block of data that is prefetched from the 
> disk, and will not prefetch the whole file if a stub of data is read.
> Please do let me know, if i understood it wrong.
>
> On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale , wrote:
> I do not think AFM is intended to solve the problem you are trying to solve.  
> If I understand your scenario correctly you state that you are placing 
> metadata on NL-SAS storage.  If that is true that would not be wise 
> especially if you are going to do many metadata operations.  I suspect your 
> performance issues are partially due to the fact that metadata is being 
> stored on NL-SAS storage.  You stated that you did not think the file heat 
> feature would do what you intended but have you tried to use it to see if it 
> could solve your problem?  I would think having metadata on SSD/flash storage 
> combined with a all flash storage pool for your heavily used files would 
> perform well.  If you expect IO usage will be such that there will be far 
> more reads than writes then LROC should be beneficial to your overall 
> performance.
>
> Regards, The Spectrum Scale (GPFS) team
>
> 

Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread valleru
Thanks a lot Sven.
I was trying out all the scenarios that Ray mentioned, with respect to lroc and 
all flash GPFS cluster and nothing seemed to be effective.

As of now, we are deploying a new test cluster on GPFS 5.0 and it would be good 
to know the respective features that could be enabled and see if it improves 
anything.

On the other side, i have seen various cases in my past 6 years with GPFS, 
where different tools do frequently use mmap. This dates back to 2013.. 
http://www.spectrumscale.org/pipermail/gpfsug-discuss/2013-May/000253.html when 
one of my colleagues asked the same question. At that time, it was a homegrown 
application that was using mmap, along with few other genomic pipelines.
An year ago, we had issue with mmap and lot of threads where GPFS would just 
hang without any traces or logs, which was fixed recently. That was related to 
relion :
https://sbgrid.org/software/titles/relion

The issue that we are seeing now is ML/DL workloads, and is related to 
implementing external tools such as openslide (http://openslide.org/), pytorch 
(http://pytorch.org/) with field of application being deep learning for 
thousands of image patches.

The IO is really slow when accessed from hard disk, and thus i was trying out 
other options such as LROC and flash cluster/afm cluster. But everything has a 
limitation as Ray mentioned.

Thanks,
Lohit

On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote:
> Hi Lohit,
>
> i am working with ray on a mmap performance improvement right now, which most 
> likely has the same root cause as yours , see -->  
> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
> the thread above is silent after a couple of back and rorth, but ray and i 
> have active communication in the background and will repost as soon as there 
> is something new to share.
> i am happy to look at this issue after we finish with ray's workload if there 
> is something missing, but first let's finish his, get you try the same fix 
> and see if there is something missing.
>
> btw. if people would share their use of MMAP , what applications they use 
> (home grown, just use lmdb which uses mmap under the cover, etc) please let 
> me know so i get a better picture on how wide the usage is with GPFS. i know 
> a lot of the ML/DL workloads are using it, but i would like to know what else 
> is out there i might not think about. feel free to drop me a personal note, i 
> might not reply to it right away, but eventually.
>
> thx. sven
>
>
> > On Thu, Feb 22, 2018 at 12:33 PM  wrote:
> > > Hi all,
> > >
> > > I wanted to know, how does mmap interact with GPFS pagepool with respect 
> > > to filesystem block-size?
> > > Does the efficiency depend on the mmap read size and the block-size of 
> > > the filesystem even if all the data is cached in pagepool?
> > >
> > > GPFS 4.2.3.2 and CentOS7.
> > >
> > > Here is what i observed:
> > >
> > > I was testing a user script that uses mmap to read from 100M to 500MB 
> > > files.
> > >
> > > The above files are stored on 3 different filesystems.
> > >
> > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
> > >
> > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the 
> > > required files fully cached" from the above GPFS cluster as home. Data 
> > > and Metadata together on SSDs
> > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data 
> > > on Near line and metadata on SSDs
> > >
> > > When i run the script first time for “each" filesystem:
> > > I see that GPFS reads from the files, and caches into the pagepool as it 
> > > reads, from mmdiag -- iohist
> > >
> > > When i run the second time, i see that there are no IO requests from the 
> > > compute node to GPFS NSD servers, which is expected since all the data 
> > > from the 3 filesystems is cached.
> > >
> > > However - the time taken for the script to run for the files in the 3 
> > > different filesystems is different - although i know that they are just 
> > > "mmapping"/reading from pagepool/cache and not from disk.
> > >
> > > Here is the difference in time, for IO just from pagepool:
> > >
> > > 20s 4M block size
> > > 15s 1M block size
> > > 40S 16M block size.
> > >
> > > Why do i see a difference when trying to mmap reads from different 
> > > block-size filesystems, although i see that the IO requests are not 
> > > hitting disks and just the pagepool?
> > >
> > > I am willing to share the strace output and mmdiag outputs if needed.
> > >
> > > Thanks,
> > > Lohit
> > >
> > > ___
> > > gpfsug-discuss mailing list
> > > gpfsug-discuss at spectrumscale.org
> > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at 

Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread IBM Spectrum Scale
My apologies for not being more clear on the flash storage pool.  I meant 
that this would be just another GPFS storage pool in the same cluster, so 
no separate AFM cache cluster.  You would then use the file heat feature 
to ensure more frequently accessed files are migrated to that all flash 
storage pool.

As for LROC could you please clarify what you mean by a few headers/stubs 
of the file?  In reading the LROC documentation and the LROC variables 
available in the mmchconfig command I think you might want to take a look 
a the lrocDataStubFileSize variable since it seems to apply to your 
situation.

Regards, The Spectrum Scale (GPFS) team

--
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479
. 

If your query concerns a potential software error in Spectrum Scale (GPFS) 
and you have an IBM software maintenance contract please contact 
1-800-237-5511 in the United States or your local IBM Service Center in 
other countries. 

The forum is informally monitored as time permits and should not be used 
for priority messages to the Spectrum Scale (GPFS) team.



From:   vall...@cbio.mskcc.org
To: gpfsug main discussion list 
Cc: gpfsug-discuss-boun...@spectrumscale.org
Date:   02/22/2018 04:21 PM
Subject:Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered 
storage
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Thank you. 

I am sorry if i was not clear, but the metadata pool is all on SSDs in the 
GPFS clusters that we use. Its just the data pool that is on Near-Line 
Rotating disks.
I understand that AFM might not be able to solve the issue, and I will try 
and see if file heat works for migrating the files to flash tier.
You mentioned an all flash storage pool for heavily used files - so you 
mean a different GPFS cluster just with flash storage, and to manually 
copy the files to flash storage whenever needed?
The IO performance that i am talking is prominently for reads, so you 
mention that LROC can work in the way i want it to? that is prefetch all 
the files into LROC cache, after only few headers/stubs of data are read 
from those files?
I thought LROC only keeps that block of data that is prefetched from the 
disk, and will not prefetch the whole file if a stub of data is read.
Please do let me know, if i understood it wrong.

On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale , 
wrote:
I do not think AFM is intended to solve the problem you are trying to 
solve.  If I understand your scenario correctly you state that you are 
placing metadata on NL-SAS storage.  If that is true that would not be 
wise especially if you are going to do many metadata operations.  I 
suspect your performance issues are partially due to the fact that 
metadata is being stored on NL-SAS storage.  You stated that you did not 
think the file heat feature would do what you intended but have you tried 
to use it to see if it could solve your problem?  I would think having 
metadata on SSD/flash storage combined with a all flash storage pool for 
your heavily used files would perform well.  If you expect IO usage will 
be such that there will be far more reads than writes then LROC should be 
beneficial to your overall performance.

Regards, The Spectrum Scale (GPFS) team

--
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479
.

If your query concerns a potential software error in Spectrum Scale (GPFS) 
and you have an IBM software maintenance contract please contact 
1-800-237-5511 in the United States or your local IBM Service Center in 
other countries.

The forum is informally monitored as time permits and should not be used 
for priority messages to the Spectrum Scale (GPFS) team.



From:vall...@cbio.mskcc.org
To:gpfsug main discussion list 
Date:02/22/2018 03:11 PM
Subject:[gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi All,

I am trying to figure out a GPFS tiering architecture with flash storage 
in front end and near line storage as backend, for Supercomputing

The Backend storage will be a GPFS storage on near line of about 8-10PB. 
The backend storage will/can be tuned to give out large streaming 
bandwidth and enough metadata disks to make the stat of all these files 
fast 

Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread valleru
Thank you.

I am sorry if i was not clear, but the metadata pool is all on SSDs in the GPFS 
clusters that we use. Its just the data pool that is on Near-Line Rotating 
disks.
I understand that AFM might not be able to solve the issue, and I will try and 
see if file heat works for migrating the files to flash tier.
You mentioned an all flash storage pool for heavily used files - so you mean a 
different GPFS cluster just with flash storage, and to manually copy the files 
to flash storage whenever needed?
The IO performance that i am talking is prominently for reads, so you mention 
that LROC can work in the way i want it to? that is prefetch all the files into 
LROC cache, after only few headers/stubs of data are read from those files?
I thought LROC only keeps that block of data that is prefetched from the disk, 
and will not prefetch the whole file if a stub of data is read.
Please do let me know, if i understood it wrong.

On Feb 22, 2018, 4:08 PM -0500, IBM Spectrum Scale , wrote:
> I do not think AFM is intended to solve the problem you are trying to solve.  
> If I understand your scenario correctly you state that you are placing 
> metadata on NL-SAS storage.  If that is true that would not be wise 
> especially if you are going to do many metadata operations.  I suspect your 
> performance issues are partially due to the fact that metadata is being 
> stored on NL-SAS storage.  You stated that you did not think the file heat 
> feature would do what you intended but have you tried to use it to see if it 
> could solve your problem?  I would think having metadata on SSD/flash storage 
> combined with a all flash storage pool for your heavily used files would 
> perform well.  If you expect IO usage will be such that there will be far 
> more reads than writes then LROC should be beneficial to your overall 
> performance.
>
> Regards, The Spectrum Scale (GPFS) team
>
> --
> If you feel that your question can benefit other users of  Spectrum Scale 
> (GPFS), then please post it to the public IBM developerWroks Forum at 
> https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479.
>
> If your query concerns a potential software error in Spectrum Scale (GPFS) 
> and you have an IBM software maintenance contract please contact  
> 1-800-237-5511 in the United States or your local IBM Service Center in other 
> countries.
>
> The forum is informally monitored as time permits and should not be used for 
> priority messages to the Spectrum Scale (GPFS) team.
>
>
>
> From:        vall...@cbio.mskcc.org
> To:        gpfsug main discussion list 
> Date:        02/22/2018 03:11 PM
> Subject:        [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
> Sent by:        gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Hi All,
>
> I am trying to figure out a GPFS tiering architecture with flash storage in 
> front end and near line storage as backend, for Supercomputing
>
> The Backend storage will be a GPFS storage on near line of about 8-10PB. The 
> backend storage will/can be tuned to give out large streaming bandwidth and 
> enough metadata disks to make the stat of all these files fast enough.
>
> I was thinking if it would be possible to use a GPFS flash cluster or GPFS 
> SSD cluster in front end that uses AFM and acts as a cache cluster with the 
> backend GPFS cluster.
>
> At the end of this .. the workflow that i am targeting is where:
>
>
> “
> If the compute nodes read headers of thousands of large files ranging from 
> 100MB to 1GB, the AFM cluster should be able to bring up enough threads to 
> bring up all of the files from the backend to the faster SSD/Flash GPFS 
> cluster.
> The working set might be about 100T, at a time which i want to be on a 
> faster/low latency tier, and the rest of the files to be in slower tier until 
> they are read by the compute nodes.
> “
>
>
> I do not want to use GPFS policies to achieve the above, is because i am not 
> sure - if policies could be written in a way, that files are moved from the 
> slower tier to faster tier depending on how the jobs interact with the files.
> I know that the policies could be written depending on the heat, and 
> size/format but i don’t think thes policies work in a similar way as above.
>
> I did try the above architecture, where an SSD GPFS cluster acts as an AFM 
> cache cluster before the near line storage. However the AFM cluster was 
> really really slow, It took it about few hours to copy the files from near 
> line storage to AFM cache cluster.
> I am not sure if AFM is not designed to work this way, or if AFM is not tuned 
> to work as fast as it should.
>
> I have tried LROC too, but it does not behave the same way as i guess AFM 
> works.
>
> Has anyone tried or know if GPFS supports an architecture - where 

Re: [gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread IBM Spectrum Scale
I do not think AFM is intended to solve the problem you are trying to 
solve.  If I understand your scenario correctly you state that you are 
placing metadata on NL-SAS storage.  If that is true that would not be 
wise especially if you are going to do many metadata operations.  I 
suspect your performance issues are partially due to the fact that 
metadata is being stored on NL-SAS storage.  You stated that you did not 
think the file heat feature would do what you intended but have you tried 
to use it to see if it could solve your problem?  I would think having 
metadata on SSD/flash storage combined with a all flash storage pool for 
your heavily used files would perform well.  If you expect IO usage will 
be such that there will be far more reads than writes then LROC should be 
beneficial to your overall performance.

Regards, The Spectrum Scale (GPFS) team

--
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479
. 

If your query concerns a potential software error in Spectrum Scale (GPFS) 
and you have an IBM software maintenance contract please contact 
1-800-237-5511 in the United States or your local IBM Service Center in 
other countries. 

The forum is informally monitored as time permits and should not be used 
for priority messages to the Spectrum Scale (GPFS) team.



From:   vall...@cbio.mskcc.org
To: gpfsug main discussion list 
Date:   02/22/2018 03:11 PM
Subject:[gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi All, 

I am trying to figure out a GPFS tiering architecture with flash storage 
in front end and near line storage as backend, for Supercomputing

The Backend storage will be a GPFS storage on near line of about 8-10PB. 
The backend storage will/can be tuned to give out large streaming 
bandwidth and enough metadata disks to make the stat of all these files 
fast enough.

I was thinking if it would be possible to use a GPFS flash cluster or GPFS 
SSD cluster in front end that uses AFM and acts as a cache cluster with 
the backend GPFS cluster.

At the end of this .. the workflow that i am targeting is where:


“ 
If the compute nodes read headers of thousands of large files ranging from 
100MB to 1GB, the AFM cluster should be able to bring up enough threads to 
bring up all of the files from the backend to the faster SSD/Flash GPFS 
cluster. 
The working set might be about 100T, at a time which i want to be on a 
faster/low latency tier, and the rest of the files to be in slower tier 
until they are read by the compute nodes.
“


I do not want to use GPFS policies to achieve the above, is because i am 
not sure - if policies could be written in a way, that files are moved 
from the slower tier to faster tier depending on how the jobs interact 
with the files.
I know that the policies could be written depending on the heat, and 
size/format but i don’t think thes policies work in a similar way as 
above.

I did try the above architecture, where an SSD GPFS cluster acts as an AFM 
cache cluster before the near line storage. However the AFM cluster was 
really really slow, It took it about few hours to copy the files from near 
line storage to AFM cache cluster.
I am not sure if AFM is not designed to work this way, or if AFM is not 
tuned to work as fast as it should. 

I have tried LROC too, but it does not behave the same way as i guess AFM 
works.

Has anyone tried or know if GPFS supports an architecture - where the fast 
tier can bring up thousands of threads and copy the files almost 
instantly/asynchronously from the slow tier, whenever the jobs from 
compute nodes reads few blocks from these files?
I understand that with respect to hardware - the AFM cluster should be 
really fast, as well as the network between the AFM cluster and the 
backend cluster.

Please do also let me know, if the above workflow can be done using GPFS 
policies and be as fast as it is needed to be.

Regards,
Lohit

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=IbxtjdkPAM2Sbon4Lbbi4w=kMYZhGPhwadAbNHucw79NJgyYAJAMgxyFZKEW-kMeqk=AT1gb89TzzE7nt58h8DYyhYkybvBY8mbXvdPjtaRRpU=





___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread Sven Oehme
Hi Lohit,

i am working with ray on a mmap performance improvement right now, which
most likely has the same root cause as yours , see -->
http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html
the thread above is silent after a couple of back and rorth, but ray and i
have active communication in the background and will repost as soon as
there is something new to share.
i am happy to look at this issue after we finish with ray's workload if
there is something missing, but first let's finish his, get you try the
same fix and see if there is something missing.

btw. if people would share their use of MMAP , what applications they use
(home grown, just use lmdb which uses mmap under the cover, etc) please let
me know so i get a better picture on how wide the usage is with GPFS. i
know a lot of the ML/DL workloads are using it, but i would like to know
what else is out there i might not think about. feel free to drop me a
personal note, i might not reply to it right away, but eventually.

thx. sven


On Thu, Feb 22, 2018 at 12:33 PM  wrote:

> Hi all,
>
> I wanted to know, how does mmap interact with GPFS pagepool with respect
> to filesystem block-size?
> Does the efficiency depend on the mmap read size and the block-size of the
> filesystem even if all the data is cached in pagepool?
>
> GPFS 4.2.3.2 and CentOS7.
>
> Here is what i observed:
>
> I was testing a user script that uses mmap to read from 100M to 500MB
> files.
>
> The above files are stored on 3 different filesystems.
>
> Compute nodes - 10G pagepool and 5G seqdiscardthreshold.
>
> 1. 4M block size GPFS filesystem, with separate metadata and data. Data on
> Near line and metadata on SSDs
> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the
> required files fully cached" from the above GPFS cluster as home. Data and
> Metadata together on SSDs
> 3. 16M block size GPFS filesystem, with separate metadata and data. Data
> on Near line and metadata on SSDs
>
> When i run the script first time for “each" filesystem:
> I see that GPFS reads from the files, and caches into the pagepool as it
> reads, from mmdiag -- iohist
>
> When i run the second time, i see that there are no IO requests from the
> compute node to GPFS NSD servers, which is expected since all the data from
> the 3 filesystems is cached.
>
> However - the time taken for the script to run for the files in the 3
> different filesystems is different - although i know that they are just
> "mmapping"/reading from pagepool/cache and not from disk.
>
> Here is the difference in time, for IO just from pagepool:
>
> 20s 4M block size
> 15s 1M block size
> 40S 16M block size.
>
> Why do i see a difference when trying to mmap reads from different
> block-size filesystems, although i see that the IO requests are not hitting
> disks and just the pagepool?
>
> I am willing to share the strace output and mmdiag outputs if needed.
>
> Thanks,
> Lohit
>
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] GPFS, MMAP and Pagepool

2018-02-22 Thread valleru
Hi all,

I wanted to know, how does mmap interact with GPFS pagepool with respect to 
filesystem block-size?
Does the efficiency depend on the mmap read size and the block-size of the 
filesystem even if all the data is cached in pagepool?

GPFS 4.2.3.2 and CentOS7.

Here is what i observed:

I was testing a user script that uses mmap to read from 100M to 500MB files.

The above files are stored on 3 different filesystems.

Compute nodes - 10G pagepool and 5G seqdiscardthreshold.

1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near 
line and metadata on SSDs
2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required 
files fully cached" from the above GPFS cluster as home. Data and Metadata 
together on SSDs
3. 16M block size GPFS filesystem, with separate metadata and data. Data on 
Near line and metadata on SSDs

When i run the script first time for “each" filesystem:
I see that GPFS reads from the files, and caches into the pagepool as it reads, 
from mmdiag -- iohist

When i run the second time, i see that there are no IO requests from the 
compute node to GPFS NSD servers, which is expected since all the data from the 
3 filesystems is cached.

However - the time taken for the script to run for the files in the 3 different 
filesystems is different - although i know that they are just 
"mmapping"/reading from pagepool/cache and not from disk.

Here is the difference in time, for IO just from pagepool:

20s 4M block size
15s 1M block size
40S 16M block size.

Why do i see a difference when trying to mmap reads from different block-size 
filesystems, although i see that the IO requests are not hitting disks and just 
the pagepool?

I am willing to share the strace output and mmdiag outputs if needed.

Thanks,
Lohit

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] GPFS and Flash/SSD Storage tiered storage

2018-02-22 Thread valleru
Hi All,

I am trying to figure out a GPFS tiering architecture with flash storage in 
front end and near line storage as backend, for Supercomputing

The Backend storage will be a GPFS storage on near line of about 8-10PB. The 
backend storage will/can be tuned to give out large streaming bandwidth and 
enough metadata disks to make the stat of all these files fast enough.

I was thinking if it would be possible to use a GPFS flash cluster or GPFS SSD 
cluster in front end that uses AFM and acts as a cache cluster with the backend 
GPFS cluster.

At the end of this .. the workflow that i am targeting is where:


“
If the compute nodes read headers of thousands of large files ranging from 
100MB to 1GB, the AFM cluster should be able to bring up enough threads to 
bring up all of the files from the backend to the faster SSD/Flash GPFS cluster.
The working set might be about 100T, at a time which i want to be on a 
faster/low latency tier, and the rest of the files to be in slower tier until 
they are read by the compute nodes.
“


I do not want to use GPFS policies to achieve the above, is because i am not 
sure - if policies could be written in a way, that files are moved from the 
slower tier to faster tier depending on how the jobs interact with the files.
I know that the policies could be written depending on the heat, and 
size/format but i don’t think thes policies work in a similar way as above.

I did try the above architecture, where an SSD GPFS cluster acts as an AFM 
cache cluster before the near line storage. However the AFM cluster was really 
really slow, It took it about few hours to copy the files from near line 
storage to AFM cache cluster.
I am not sure if AFM is not designed to work this way, or if AFM is not tuned 
to work as fast as it should.

I have tried LROC too, but it does not behave the same way as i guess AFM works.

Has anyone tried or know if GPFS supports an architecture - where the fast tier 
can bring up thousands of threads and copy the files almost 
instantly/asynchronously from the slow tier, whenever the jobs from compute 
nodes reads few blocks from these files?
I understand that with respect to hardware - the AFM cluster should be really 
fast, as well as the network between the AFM cluster and the backend cluster.

Please do also let me know, if the above workflow can be done using GPFS 
policies and be as fast as it is needed to be.

Regards,
Lohit


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmfind - Use mmfind ... -xargs

2018-02-22 Thread Marc A Kaplan
More recent versions of mmfind support an -xargs option...  Run mmfind 
--help and see:

   -xargs [-L maxlines] [-I rplstr] COMMAND

  Similar to find ... | xargs [-L x] [-I r] COMMAND

  but COMMAND executions may run in parallel. This is preferred
  to -exec. With -xargs mmfind will run the COMMANDs in phase subject 
to
  mmapplypolicy options -m, -B, -N. Must be the last option to mmfind

This gives you the fully parallelized power of mmapplypolicy without 
having to write SQL rules nor scripts.



From:   John Hearns 
To: gpfsug main discussion list 
Date:   02/21/2018 11:00 PM
Subject:[gpfsug-discuss] mmfind - a ps.
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Ps. Her is how to get mmfind to run some operation on the files it finds.
(I installed mmfind in /usr/local/bin)
 
I find this very hacky, though I suppose it is idiomatic bash
 
#!/bin/bash
 
while read filename
do
   echo -n   $filename " "
done <<< "`/usr/local/bin/mmfind  /hpc/bscratch -type f`"
-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the 
intended recipient(s). Any unauthorized review, use, disclosure or 
distribution is prohibited. Unless explicitly stated otherwise in the body 
of this communication or the attachment thereto (if any), the information 
is provided on an AS-IS basis without any express or implied warranties or 
liabilities. To the extent you are relying on this information, you are 
doing so at your own risk. If you are not the intended recipient, please 
notify the sender immediately by replying to this message and destroy all 
copies of this message and any attachments. Neither the sender nor the 
company/group of companies he or she represents shall be liable for the 
proper and complete transmission of the information contained in this 
communication, or for any delay in its receipt. 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=cvpnBBH0j41aQy0RPiG2xRL_M8mTc1izuQD3_PmtjZ8=vbcae5NoH6gMQCovOqRVJVgj9jJ2USmq47GHxVn6En8=F_GqjJRzSzubUSXpcjysWCwCjhVKO9YrbUdzjusY0SY=





___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmfind -ls

2018-02-22 Thread Marc A Kaplan
Leaving aside the -exec option, and whether you choose classic find or 
mmfind,

why not just use the -ls option - same output, less overhead...

mmfind pathname -type f -ls 




From:   John Hearns 
To: gpfsug main discussion list 
Cc: "gpfsug-discuss-boun...@spectrumscale.org" 

Date:   02/22/2018 04:03 AM
Subject:Re: [gpfsug-discuss] mmfind will not exec
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Stupid me. The space between the {}  and \; is significant.
/usr/local/bin/mmfind /hpc/bscratch -type f -exec /bin/ls {} \;
 
Still would be nice to have the documentation clarified please.
  
 
 
 
 
From: gpfsug-discuss-boun...@spectrumscale.org [
mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of IBM Spectrum 
Scale
Sent: Thursday, February 22, 2018 2:26 AM
To: gpfsug main discussion list 
Cc: gpfsug-discuss-boun...@spectrumscale.org
Subject: Re: [gpfsug-discuss] mmfind will not exec
 
Looking at the mmfind.README it indicates that it only supports the format 
you used with the semi-colon.  Did you capture any output of the problem?

Regards, The Spectrum Scale (GPFS) team

--
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479
. 

If your query concerns a potential software error in Spectrum Scale (GPFS) 
and you have an IBM software maintenance contract please contact 
1-800-237-5511 in the United States or your local IBM Service Center in 
other countries. 

The forum is informally monitored as time permits and should not be used 
for priority messages to the Spectrum Scale (GPFS) team.



From:John Hearns 
To:gpfsug main discussion list 
Date:02/21/2018 06:45 PM
Subject:[gpfsug-discuss] mmfind will not exec
Sent by:gpfsug-discuss-boun...@spectrumscale.org




I would dearly like to use mmfind in a project I am working on  (version 
4.2.3.4 at the moment)
 
mmfind /hpc/bscratch  -type f  work fine
 
mmfind /hpc/bscratch  -type f -exec /bin/ls {}\ ;  crashes and burns
 
I know there are supposed to be problems with exec and mmfind, and this is 
sample software shipped without warranty etc.
But why let me waste hours on this when it won’t work?
There is even an example in the README for mmfind 
 
./mmfind /encFS -type f -exec /bin/readMyFile {} \;
But in the help for mmfind:
-exec COMMANDs are terminated by a standalone ';' or by the string '{} +’
 
So which is it? The normal find version {} \;or {} +
 
-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the 
intended recipient(s). Any unauthorized review, use, disclosure or 
distribution is prohibited. Unless explicitly stated otherwise in the body 
of this communication or the attachment thereto (if any), the information 
is provided on an AS-IS basis without any express or implied warranties or 
liabilities. To the extent you are relying on this information, you are 
doing so at your own risk. If you are not the intended recipient, please 
notify the sender immediately by replying to this message and destroy all 
copies of this message and any attachments. Neither the sender nor the 
company/group of companies he or she represents shall be liable for the 
proper and complete transmission of the information contained in this 
communication, or for any delay in its receipt. 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=IbxtjdkPAM2Sbon4Lbbi4w=OC7XNZeulP0vmS8Fq-RJuun5wOqFPootm0QHxBXUfKg=LUvpk53AaNcHSGQgDgH8FAiOOsH1H0OPOV9MFGMIi9E=
 
-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the 
intended recipient(s). Any unauthorized review, use, disclosure or 
distribution is prohibited. Unless explicitly stated otherwise in the body 
of this communication or the attachment thereto (if any), the information 
is provided on an AS-IS basis without any express or implied warranties or 
liabilities. To the extent you are relying on this information, you are 
doing so at your own risk. If you are not the intended recipient, please 
notify the sender immediately by replying to this message and destroy all 
copies of this message and any attachments. Neither the sender nor the 
company/group of companies he 

Re: [gpfsug-discuss] mmfind will not exec

2018-02-22 Thread John Hearns
Stupid me. The space between the {}  and \; is significant.
/usr/local/bin/mmfind /hpc/bscratch -type f -exec /bin/ls {} \;

Still would be nice to have the documentation clarified please.





From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of IBM Spectrum 
Scale
Sent: Thursday, February 22, 2018 2:26 AM
To: gpfsug main discussion list 
Cc: gpfsug-discuss-boun...@spectrumscale.org
Subject: Re: [gpfsug-discuss] mmfind will not exec

Looking at the mmfind.README it indicates that it only supports the format you 
used with the semi-colon.  Did you capture any output of the problem?

Regards, The Spectrum Scale (GPFS) team

--
If you feel that your question can benefit other users of  Spectrum Scale 
(GPFS), then please post it to the public IBM developerWroks Forum at 
https://www.ibm.com/developerworks/community/forums/html/forum?id=----0479.

If your query concerns a potential software error in Spectrum Scale (GPFS) and 
you have an IBM software maintenance contract please contact  1-800-237-5511 in 
the United States or your local IBM Service Center in other countries.

The forum is informally monitored as time permits and should not be used for 
priority messages to the Spectrum Scale (GPFS) team.



From:John Hearns >
To:gpfsug main discussion list 
>
Date:02/21/2018 06:45 PM
Subject:[gpfsug-discuss] mmfind will not exec
Sent by:
gpfsug-discuss-boun...@spectrumscale.org




I would dearly like to use mmfind in a project I am working on  (version 
4.2.3.4 at the moment)

mmfind /hpc/bscratch  -type f  work fine

mmfind /hpc/bscratch  -type f -exec /bin/ls {}\ ;  crashes and burns

I know there are supposed to be problems with exec and mmfind, and this is 
sample software shipped without warranty etc.
But why let me waste hours on this when it won’t work?
There is even an example in the README for mmfind

./mmfind /encFS -type f -exec /bin/readMyFile {} \;
But in the help for mmfind:
-exec COMMANDs are terminated by a standalone ';' or by the string '{} +’

So which is it? The normal find version {} \;or {} +


-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. Unless explicitly stated otherwise in the body of this 
communication or the attachment thereto (if any), the information is provided 
on an AS-IS basis without any express or implied warranties or liabilities. To 
the extent you are relying on this information, you are doing so at your own 
risk. If you are not the intended recipient, please notify the sender 
immediately by replying to this message and destroy all copies of this message 
and any attachments. Neither the sender nor the company/group of companies he 
or she represents shall be liable for the proper and complete transmission of 
the information contained in this communication, or for any delay in its 
receipt. ___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwICAg=jf_iaSHvJObTbx-siA1ZOg=IbxtjdkPAM2Sbon4Lbbi4w=OC7XNZeulP0vmS8Fq-RJuun5wOqFPootm0QHxBXUfKg=LUvpk53AaNcHSGQgDgH8FAiOOsH1H0OPOV9MFGMIi9E=



-- The information contained in this communication and any attachments is 
confidential and may be privileged, and is for the sole use of the intended 
recipient(s). Any unauthorized review, use, disclosure or distribution is 
prohibited. Unless explicitly stated otherwise in the body of this 
communication or the attachment