Re: [gpfsug-discuss] Blocksize

2016-09-28 Thread Greg.Lehmann
Are there any presentation available online that provide diagrams of the 
directory/file creation process and modifications in terms of how the 
blocks/inodes and indirect blocks etc are used. I would guess there are a few 
different cases that would need to be shown.

This is the sort of thing that would great in a decent text book on GPFS 
(doesn't exist as far as I am aware.)

Cheers,

Greg

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Marc A Kaplan
Sent: Thursday, 29 September 2016 1:23 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Blocksize

OKAY, I'll say it again.  inodes are PACKED into a single inode file.  So a 4KB 
inode takes 4KB, REGARDLESS of metadata blocksize.  There is no wasted space.

(Of course if you have metadata replication = 2, then yes, double that.  And 
yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.)

And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good choice for 
your data distribution, to optimize packing of data and/or directories into 
inodes...

Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...

mmcrfs x2K -i 2048

[root@n2 charts]# mmlsfs x2K -i
flagvaluedescription
---  ---
 -i 2048 Inode size in bytes

Works for me!
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize - file size distribution

2016-09-28 Thread Edward Wahl
On Wed, 28 Sep 2016 10:34:05 -0400
Marc A Kaplan  wrote:

> Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ...
> SHOW rule) to gather the stats much faster.  Should be minutes, not
> hours.
> 


I'll agree with the policy engine.  Runs like a beast if you tune it a
little for nodes and threads. 

 Only takes a couple of minutes to collect info on over a hundred
 million files. Show where the data is now by pool and sort it by age
 with queries? quick hack up example. you could sort the mess on the
 front end fairly quickly. (use fileset or pool, etc as your storage
 needs)

RULE '2yrold_files' LIST  '2yrold_filelist.txt'

SHOW (varchar(file_size) || '  ' || varchar(USER_ID) || '  ' || 
varchar(POOL_NAME))
WHERE DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) >= 730 AND 
DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) < 1095

don't forget to run the engine with the -I defer for this kind of
list/show policy.

Ed





-- 

Ed Wahl
Ohio Supercomputer Center
614-292-9302
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-28 Thread Marc A Kaplan
OKAY, I'll say it again.  inodes are PACKED into a single inode file.  So 
a 4KB inode takes 4KB, REGARDLESS of metadata blocksize.  There is no 
wasted space.

(Of course if you have metadata replication = 2, then yes, double that. 
And yes, there overhead for indirect blocks (indices), allocation maps, 
etc, etc.)

And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good 
choice for your data distribution, to optimize packing of data and/or 
directories into inodes...

Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...

mmcrfs x2K -i 2048

[root@n2 charts]# mmlsfs x2K -i
flagvaluedescription
---  
---
 -i 2048 Inode size in bytes

Works for me!

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize - file size distribution

2016-09-28 Thread Marc A Kaplan
Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... SHOW 
rule) to gather the stats much faster.  Should be minutes, not hours.



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize - file size distribution

2016-09-28 Thread Oesterlin, Robert
/usr/lpp/mmfs/samples/debugtools/filehist

Look at the README in that directory.


Bob Oesterlin
Sr Storage Engineer, Nuance HPC Grid


From:  on behalf of 
"greg.lehm...@csiro.au" 
Reply-To: gpfsug main discussion list 
Date: Wednesday, September 28, 2016 at 2:40 AM
To: "gpfsug-discuss@spectrumscale.org" 
Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize

I am wondering what people use to produce a file size distribution report for 
their filesystems. Has everyone rolled their own or is there some goto app to 
use.

Cheers,

Greg

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, 
Kevin L
Sent: Wednesday, 28 September 2016 7:21 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Blocksize

Hi All,

Again, my thanks to all who responded to my last post.  Let me begin by stating 
something I unintentionally omitted in my last post … we already use SSDs for 
our metadata.

Which leads me to yet another question … of my three filesystems, two (/home 
and /scratch) are much older (created in 2010) and therefore currently have a 
512 byte inode size.  /data is newer and has a 4K inode size.  Now if I combine 
/scratch and /data into one filesystem with a 4K inode size, the amount of 
space used by all the inodes coming over from /scratch is going to grow by a 
factor of eight unless I’m horribly confused.  And I would assume I need to 
count the amount of space taken up by allocated inodes, not just used inodes.

Therefore … how much space my metadata takes up just grew significantly in 
importance since:  1) metadata is on very expensive enterprise class, vendor 
certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata 
replication set to two.  Some of the information presented by Sven and Yuri 
seems to contradict each other in regards to how much space inodes take up … or 
I’m misunderstanding one or both of them!  Leaving aside replication, if I use 
a 256K block size for my metadata and I use 4K inodes, are those inodes going 
to take up 4K each or are they going to take up 8K each (1/32nd of a 256K 
block)?

By the way, I do have a file size / file age spreadsheet for each of my 
filesystems (which I would be willing to share with interested parties) and 
while I was not surprised to learn that I have over 10 million sub-1K files on 
/home, I was a bit surprised to find that I have almost as many sub-1K files on 
/scratch (and a few million more on /data).  So there’s a huge potential win in 
having those files in the inode on SSD as opposed to on spinning disk, but 
there’s also a huge potential $$$ cost.

Thanks again … I hope others are gaining useful information from this thread.  
I sure am!

Kevin

On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev 
mailto:volob...@us.ibm.com>> wrote:

> 1) Let’s assume that our overarching goal in configuring the block
> size for metadata is performance from the user perspective … i.e.
> how fast is an “ls -l” on my directory?  Space savings aren’t
> important, and how long policy scans or other “administrative” type
> tasks take is not nearly as important as that directory listing.
> Does that change the recommended metadata block size?

The performance challenges for the "ls -l" scenario are quite different from 
the policy scan scenario, so the same rules do not necessarily apply.

During "ls -l" the code has to read inodes one by one (there's some prefetching 
going on, to take the edge off for the actual 'ls' thread, but prefetching is 
still one inode at a time).  Metadata block size doesn't really come into the 
picture in this case, but inode size can be important -- depending on the 
storage performance characteristics.  Does the storage you use for metadata 
exhibit a meaningfully different latency for 4K random reads vs 512 byte random 
reads?  In my personal experience, on any modern storage device the difference 
is non-existent; in fact many devices (like all flash-based storage) use 4K 
native physical block size, and merely emulate 512 byte "sectors", so there's 
no way to read less than 4K.  So from the inode read latency point of view 4K 
vs 512B is most likely a wash, but then 4K inodes can help improve performance 
of other operations, e.g. readdir of a small directory which fits entirely into 
the inode.  If you use xattrs (e.g. as a side effect of using HSM), 4K inodes 
definitely help, but allowing xattrs to be stored in the inode.

Policy scans reads inodes in full blocks, and there both metadata block size 
and inode size matter.  Larger blocks could improve the inode read performance, 
while larger inodes mean that the same number of blocks hold fewer inodes and 
thus more blocks need to be read.  So the policy inode scan performance is 
benefited by larger metadata block size and smaller inodes.  However, policy 
scans also have to perfo

Re: [gpfsug-discuss] Blocksize

2016-09-28 Thread Greg.Lehmann
I am wondering what people use to produce a file size distribution report for 
their filesystems. Has everyone rolled their own or is there some goto app to 
use.

Cheers,

Greg

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, 
Kevin L
Sent: Wednesday, 28 September 2016 7:21 AM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Blocksize

Hi All,

Again, my thanks to all who responded to my last post.  Let me begin by stating 
something I unintentionally omitted in my last post … we already use SSDs for 
our metadata.

Which leads me to yet another question … of my three filesystems, two (/home 
and /scratch) are much older (created in 2010) and therefore currently have a 
512 byte inode size.  /data is newer and has a 4K inode size.  Now if I combine 
/scratch and /data into one filesystem with a 4K inode size, the amount of 
space used by all the inodes coming over from /scratch is going to grow by a 
factor of eight unless I’m horribly confused.  And I would assume I need to 
count the amount of space taken up by allocated inodes, not just used inodes.

Therefore … how much space my metadata takes up just grew significantly in 
importance since:  1) metadata is on very expensive enterprise class, vendor 
certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata 
replication set to two.  Some of the information presented by Sven and Yuri 
seems to contradict each other in regards to how much space inodes take up … or 
I’m misunderstanding one or both of them!  Leaving aside replication, if I use 
a 256K block size for my metadata and I use 4K inodes, are those inodes going 
to take up 4K each or are they going to take up 8K each (1/32nd of a 256K 
block)?

By the way, I do have a file size / file age spreadsheet for each of my 
filesystems (which I would be willing to share with interested parties) and 
while I was not surprised to learn that I have over 10 million sub-1K files on 
/home, I was a bit surprised to find that I have almost as many sub-1K files on 
/scratch (and a few million more on /data).  So there’s a huge potential win in 
having those files in the inode on SSD as opposed to on spinning disk, but 
there’s also a huge potential $$$ cost.

Thanks again … I hope others are gaining useful information from this thread.  
I sure am!

Kevin

On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev 
mailto:volob...@us.ibm.com>> wrote:

> 1) Let’s assume that our overarching goal in configuring the block
> size for metadata is performance from the user perspective … i.e.
> how fast is an “ls -l” on my directory?  Space savings aren’t
> important, and how long policy scans or other “administrative” type
> tasks take is not nearly as important as that directory listing.
> Does that change the recommended metadata block size?

The performance challenges for the "ls -l" scenario are quite different from 
the policy scan scenario, so the same rules do not necessarily apply.

During "ls -l" the code has to read inodes one by one (there's some prefetching 
going on, to take the edge off for the actual 'ls' thread, but prefetching is 
still one inode at a time).  Metadata block size doesn't really come into the 
picture in this case, but inode size can be important -- depending on the 
storage performance characteristics.  Does the storage you use for metadata 
exhibit a meaningfully different latency for 4K random reads vs 512 byte random 
reads?  In my personal experience, on any modern storage device the difference 
is non-existent; in fact many devices (like all flash-based storage) use 4K 
native physical block size, and merely emulate 512 byte "sectors", so there's 
no way to read less than 4K.  So from the inode read latency point of view 4K 
vs 512B is most likely a wash, but then 4K inodes can help improve performance 
of other operations, e.g. readdir of a small directory which fits entirely into 
the inode.  If you use xattrs (e.g. as a side effect of using HSM), 4K inodes 
definitely help, but allowing xattrs to be stored in the inode.

Policy scans reads inodes in full blocks, and there both metadata block size 
and inode size matter.  Larger blocks could improve the inode read performance, 
while larger inodes mean that the same number of blocks hold fewer inodes and 
thus more blocks need to be read.  So the policy inode scan performance is 
benefited by larger metadata block size and smaller inodes.  However, policy 
scans also have to perform a directory traversal step, and that step tends to 
dominate the runtime of the overall run, and using larger inodes actually helps 
to speed up traversal of smaller directories.  So whether larger inodes help or 
hurt the policy scan performance depends, yet again, on your file system 
composition.  Overall, I believe that with all angles considered, larger inodes 
help wit

Re: [gpfsug-discuss] Blocksize

2016-09-27 Thread Buterbaugh, Kevin L
ratch or /data.  /home has tons of small
> files - so small that a 64K block size is currently used.  /scratch
> and /data have a mixture, but a 1 MB block size is the “sweet spot” there.
>
> If you could “start all over” with the same hardware being the only
> restriction, would you:
>
> a) merge /scratch and /data into one filesystem but keep /home
> separate since the LUN sizes are so very different, or
> b) merge all three into one filesystem and use storage pools so that
> /home is just a separate pool within the one filesystem?  And if you
> chose this option would you assign different block sizes to the pools?

It's not possible to have different block sizes for different data pools.  We 
are very aware that many people would like to be able to do just that, but this 
is counter to where the product is going.  Supporting different block sizes for 
different pools is actually pretty hard: it's tricky to describe a large file 
that has some blocks in poolA and some in poolB where poolB has a different 
block size (perhaps during a migration) with the existing inode/indirect block 
format where each disk address pointer addresses a block of fixed size.  With 
some effort, and some changes to how block addressing works, it would be 
possible to implement the support for this.  However, as I mentioned in another 
post in this thread, we don't really want to glorify manual block size 
selection any further, we want to move away from it, by addressing the reasons 
that drive different block size selection today (like disk space utilization 
and performance).

I'd recommend calculating a file size distribution histogram for your file 
systems.  You may, for example, discover that 80% of the small files you have 
in /home would fit into 4K inodes, and then the storage space efficiency gains 
for the remaining 20% don't justify the complexity of managing an extra file 
system with a small block size.  We don't recommend using block sizes smaller 
than 256K, because smaller block size is not good for disk space allocation 
code efficiency.  It's a quadratic dependency: with a smaller block size, one 
block worth of the block allocation map covers that much less disk space, 
because each bit in the map covers fewer disk sectors, and fewer bits fit into 
a block.  This means having to create a lot more block allocation map segments 
than what is needed for an ample level of parallelism.  This hurts performance 
of many block allocation-related operations.

I don't see a reason for /scratch and /data to be separate file systems, aside 
from perhaps failure domain considerations.

yuri


> On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev 
> mailto:volob...@us.ibm.com>> wrote:
>
> I would put the net summary this way: in GPFS, the "Goldilocks zone"
> for metadata block size is 256K - 1M. If one plans to create a new
> file system using GPFS V4.2+, 1M is a sound choice.
>
> In an ideal world, block size choice shouldn't really be a choice.
> It's a low-level implementation detail that one day should go the
> way of the manual ignition timing adjustment -- something that used
> to be necessary in the olden days, and something that select
> enthusiasts like to tweak to this day, but something that's
> irrelevant for the overwhelming majority of the folks who just want
> the engine to run. There's work being done in that general direction
> in GPFS, but we aren't there yet.
>
> yuri
>
> Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got
> anther question… which I’ll let bake for a while. Okay, to (poorly) summarize:
>
> From: Stephen Ulmer mailto:ul...@ulmer.org>>
> To: gpfsug main discussion list 
> mailto:gpfsug-discuss@spectrumscale.org>>,
> Date: 09/26/2016 12:02 PM
> Subject: Re: [gpfsug-discuss] Blocksize
> Sent by: 
> gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org>
>
>
>
> Now I’ve got anther question… which I’ll let bake for a while.
>
> Okay, to (poorly) summarize:
> There are items OTHER THAN INODES stored as metadata in GPFS.
> These items have a VARIETY OF SIZES, but are packed in such a way
> that we should just not worry about wasted space unless we pick a
> LARGE metadata block size — or if we don’t pick a “reasonable”
> metadata block size after picking a “large” file system block size
> that applies to both.
> Performance is hard, and the gain from calculating exactly the best
> metadata block size is much smaller than performance gains attained
> through code optimization.
> If we were to try and calculate the appropriate metadata block size
> we would likely be wrong anyway, since none of us get our data at
> the idealized physics shop that sells massless rulers and
> frictionless pulleys.
&g

Re: [gpfsug-discuss] Blocksize

2016-09-27 Thread Yuri L Volobuev
cation code efficiency.  It's a quadratic dependency:
with a smaller block size, one block worth of the block allocation map
covers that much less disk space, because each bit in the map covers fewer
disk sectors, and fewer bits fit into a block.  This means having to create
a lot more block allocation map segments than what is needed for an ample
level of parallelism.  This hurts performance of many block
allocation-related operations.

I don't see a reason for /scratch and /data to be separate file systems,
aside from perhaps failure domain considerations.

yuri


> On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev  wrote:
>
> I would put the net summary this way: in GPFS, the "Goldilocks zone"
> for metadata block size is 256K - 1M. If one plans to create a new
> file system using GPFS V4.2+, 1M is a sound choice.
>
> In an ideal world, block size choice shouldn't really be a choice.
> It's a low-level implementation detail that one day should go the
> way of the manual ignition timing adjustment -- something that used
> to be necessary in the olden days, and something that select
> enthusiasts like to tweak to this day, but something that's
> irrelevant for the overwhelming majority of the folks who just want
> the engine to run. There's work being done in that general direction
> in GPFS, but we aren't there yet.
>
> yuri
>
> Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got
> anther question… which I’ll let bake for a while. Okay, to (poorly)
summarize:
>
> From: Stephen Ulmer 
> To: gpfsug main discussion list ,
> Date: 09/26/2016 12:02 PM
> Subject: Re: [gpfsug-discuss] Blocksize
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
>
>
>
> Now I’ve got anther question… which I’ll let bake for a while.
>
> Okay, to (poorly) summarize:
> There are items OTHER THAN INODES stored as metadata in GPFS.
> These items have a VARIETY OF SIZES, but are packed in such a way
> that we should just not worry about wasted space unless we pick a
> LARGE metadata block size — or if we don’t pick a “reasonable”
> metadata block size after picking a “large” file system block size
> that applies to both.
> Performance is hard, and the gain from calculating exactly the best
> metadata block size is much smaller than performance gains attained
> through code optimization.
> If we were to try and calculate the appropriate metadata block size
> we would likely be wrong anyway, since none of us get our data at
> the idealized physics shop that sells massless rulers and
> frictionless pulleys.
> We should probably all use a metadata block size around 1MB. Nobody
> has said this outright, but it’s been the example as the “good” size
> at least three times in this thread.
> Under no circumstances should we do what many of us would have done
> and pick 128K, which made sense based on all of our previous
> education that is no longer applicable.
>
> Did I miss anything? :)
>
> Liberty,
>
> --
> Stephen
>

> On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev  wrote:
> It's important to understand the differences between different
> metadata types, in particular where it comes to space allocation.
>
> System metadata files (inode file, inode and block allocation maps,
> ACL file, fileset metadata file, EA file in older versions) are
> allocated at well-defined moments (file system format, new storage
> pool creation in the case of block allocation map, etc), and those
> contain multiple records packed into a single block. From the block
> allocator point of view, the individual metadata record size is
> invisible, only larger blocks get actually allocated, and space
> usage efficiency generally isn't an issue.
>
> For user metadata (indirect blocks, directory blocks, EA overflow
> blocks) the situation is different. Those get allocated as the need
> arises, generally one at a time. So the size of an individual
> metadata structure matters, a lot. The smallest unit of allocation
> in GPFS is a subblock (1/32nd of a block). If an IB or a directory
> block is smaller than a subblock, the unused space in the subblock
> is wasted. So if one chooses to use, say, 16 MiB block size for
> metadata, the smallest unit of space that can be allocated is 512
> KiB. If one chooses 1 MiB block size, the smallest allocation unit
> is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with
> any reasonable data block size); directory blocks used to be limited
> to 32 KiB, but in the current code can be as large as 256 KiB. As
> one can observe, using 16 MiB metadata block size would lead to a
> considerable amount of wasted space for IBs and large directories
> (small directories can live in inodes). On the other hand, with 1
> MiB block si

Re: [gpfsug-discuss] Blocksize

2016-09-27 Thread Kevin D Johnson
de optimization.If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys.We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it’s been the example as the “good” size at least three times in this thread.Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable.Did I miss anything? :)Liberty,--Stephen 
On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev <volob...@us.ibm.com> wrote:
It's important to understand the differences between different metadata types, in particular where it comes to space allocation.System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue.For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers.The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though.yuri"Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very intFrom: "Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu>        To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>,Date: 09/24/2016 07:19 AMSubject: Re: [gpfsug-discuss] BlocksizeSent by: gpfsug-discuss-boun...@spectrumscale.org
 
Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes.If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn’t you be wasting a massive amount of space?What am I mis

Re: [gpfsug-discuss] Blocksize

2016-09-27 Thread Alex Chekholko


On 09/27/2016 10:02 AM, Buterbaugh, Kevin L wrote:

1) Let’s assume that our overarching goal in configuring the block size
for metadata is performance from the user perspective … i.e. how fast is
an “ls -l” on my directory?  Space savings aren’t important, and how
long policy scans or other “administrative” type tasks take is not
nearly as important as that directory listing.  Does that change the
recommended metadata block size?


You need to put your metadata on SSDs.  Make your SSDs the only members 
in your 'system' pool and put your other devices into another pool, and 
make that pool 'dataOnly'.  If your SSDs are large enough to also hold 
some data, that's great; I typically do a migration policy to copy files 
smaller than filesystem block size (or definitely smaller than sub-block 
size) to the SSDs.  Also, files smaller than 4k will usually fit into 
the inode (if you are using the 4k inode size).


I have a system where the SSDs are regularly doing 6-7k IOPS for 
metadata stuff.  If those same 7k IOPS were spread out over the slow 
data LUNs... which only have like 100 IOPS per 8+2P LUN...  I'd be 
consuming 700 disks just for metadata IOPS.


--
Alex Chekholko ch...@stanford.edu

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize, yea, inode size!

2016-09-27 Thread Marc A Kaplan
inode size will be a crucial choice in the scenario you describe.

Consider the conflict: A large inode can hold a complete file or a 
complete directory.
But the bigger the inode size, the less that fit in any given block size 
-- so when you have to read several inodes ...  more IO, less likely that 
inodes you want are in the same block.



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-27 Thread Buterbaugh, Kevin L
Yuri / Sven / anyone else who wants to jump in,

First off, thank you very much for your answers.  I’d like to follow up with a 
couple of more questions.

1) Let’s assume that our overarching goal in configuring the block size for 
metadata is performance from the user perspective … i.e. how fast is an “ls -l” 
on my directory?  Space savings aren’t important, and how long policy scans or 
other “administrative” type tasks take is not nearly as important as that 
directory listing.  Does that change the recommended metadata block size?

2)  Let’s assume we have 3 filesystems, /home, /scratch (traditional HPC use 
for those two) and /data (project space).  Our storage arrays are 24-bay units 
with two 8+2P RAID 6 LUNs, one RAID 1 mirror, and two hot spare drives.  The 
RAID 1 mirrors are for /home, the RAID 6 LUNs are for /scratch or /data.  /home 
has tons of small files - so small that a 64K block size is currently used.  
/scratch and /data have a mixture, but a 1 MB block size is the “sweet spot” 
there.

If you could “start all over” with the same hardware being the only 
restriction, would you:

a) merge /scratch and /data into one filesystem but keep /home separate since 
the LUN sizes are so very different, or
b) merge all three into one filesystem and use storage pools so that /home is 
just a separate pool within the one filesystem?  And if you chose this option 
would you assign different block sizes to the pools?

Again, I’m asking these questions because I may have the opportunity to 
effectively “start all over” and want to make sure I’m doing things as 
optimally as possible.  Thanks…

Kevin

On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev 
mailto:volob...@us.ibm.com>> wrote:


I would put the net summary this way: in GPFS, the "Goldilocks zone" for 
metadata block size is 256K - 1M. If one plans to create a new file system 
using GPFS V4.2+, 1M is a sound choice.

In an ideal world, block size choice shouldn't really be a choice. It's a 
low-level implementation detail that one day should go the way of the manual 
ignition timing adjustment -- something that used to be necessary in the olden 
days, and something that select enthusiasts like to tweak to this day, but 
something that's irrelevant for the overwhelming majority of the folks who just 
want the engine to run. There's work being done in that general direction in 
GPFS, but we aren't there yet.

yuri

Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got anther 
question… which I’ll let bake for a while. Okay, to (poorly) summarize:

From: Stephen Ulmer mailto:ul...@ulmer.org>>
To: gpfsug main discussion list 
mailto:gpfsug-discuss@spectrumscale.org>>,
Date: 09/26/2016 12:02 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: 
gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org>





Now I’ve got anther question… which I’ll let bake for a while.

Okay, to (poorly) summarize:

 *   There are items OTHER THAN INODES stored as metadata in GPFS.
 *   These items have a VARIETY OF SIZES, but are packed in such a way that 
we should just not worry about wasted space unless we pick a LARGE metadata 
block size — or if we don’t pick a “reasonable” metadata block size after 
picking a “large” file system block size that applies to both.
 *   Performance is hard, and the gain from calculating exactly the best 
metadata block size is much smaller than performance gains attained through 
code optimization.
 *   If we were to try and calculate the appropriate metadata block size we 
would likely be wrong anyway, since none of us get our data at the idealized 
physics shop that sells massless rulers and frictionless pulleys.
 *   We should probably all use a metadata block size around 1MB. Nobody 
has said this outright, but it’s been the example as the “good” size at least 
three times in this thread.
 *   Under no circumstances should we do what many of us would have done 
and pick 128K, which made sense based on all of our previous education that is 
no longer applicable.

Did I miss anything? :)

Liberty,

--
Stephen



On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev 
mailto:volob...@us.ibm.com>> wrote:

It's important to understand the differences between different metadata types, 
in particular where it comes to space allocation.

System metadata files (inode file, inode and block allocation maps, ACL file, 
fileset metadata file, EA file in older versions) are allocated at well-defined 
moments (file system format, new storage pool creation in the case of block 
allocation map, etc), and those contain multiple records packed into a single 
block. From the block allocator point of view, the individual metadata record 
size is invisible, only larger blocks get actually allocated, and space usage 
efficiency generally isn't an issue.

For user metadata (indirect blocks, d

Re: [gpfsug-discuss] Blocksize

2016-09-26 Thread Yuri L Volobuev
I would put the net summary this way: in GPFS, the "Goldilocks zone" for
metadata block size is 256K - 1M.  If one plans to create a new file system
using GPFS V4.2+, 1M is a sound choice.

In an ideal world, block size choice shouldn't really be a choice.  It's a
low-level implementation detail that one day should go the way of the
manual ignition timing adjustment -- something that used to be necessary in
the olden days, and something that select enthusiasts like to tweak to this
day, but something that's irrelevant for the overwhelming majority of the
folks who just want the engine to run.  There's work being done in that
general direction in GPFS, but we aren't there yet.

yuri



From:   Stephen Ulmer 
To: gpfsug main discussion list ,
Date:   09/26/2016 12:02 PM
Subject:    Re: [gpfsug-discuss] Blocksize
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Now I’ve got anther question… which I’ll let bake for a while.

Okay, to (poorly) summarize:
  There are items OTHER THAN INODES stored as metadata in GPFS.
  These items have a VARIETY OF SIZES, but are packed in such a way
  that we should just not worry about wasted space unless we pick a
  LARGE metadata block size — or if we don’t pick a “reasonable”
  metadata block size after picking a “large” file system block size
  that applies to both.
  Performance is hard, and the gain from calculating exactly the best
  metadata block size is much smaller than performance gains attained
  through code optimization.
  If we were to try and calculate the appropriate metadata block size
  we would likely be wrong anyway, since none of us get our data at the
  idealized physics shop that sells massless rulers and frictionless
  pulleys.
  We should probably all use a metadata block size around 1MB. Nobody
  has said this outright, but it’s been the example as the “good” size
  at least three times in this thread.
  Under no circumstances should we do what many of us would have done
  and pick 128K, which made sense based on all of our previous
  education that is no longer applicable.

Did I miss anything? :)

Liberty,

--
Stephen



  On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev 
  wrote:



  It's important to understand the differences between different
  metadata types, in particular where it comes to space allocation.

  System metadata files (inode file, inode and block allocation maps,
  ACL file, fileset metadata file, EA file in older versions) are
  allocated at well-defined moments (file system format, new storage
  pool creation in the case of block allocation map, etc), and those
  contain multiple records packed into a single block. From the block
  allocator point of view, the individual metadata record size is
  invisible, only larger blocks get actually allocated, and space usage
  efficiency generally isn't an issue.

  For user metadata (indirect blocks, directory blocks, EA overflow
  blocks) the situation is different. Those get allocated as the need
  arises, generally one at a time. So the size of an individual
  metadata structure matters, a lot. The smallest unit of allocation in
  GPFS is a subblock (1/32nd of a block). If an IB or a directory block
  is smaller than a subblock, the unused space in the subblock is
  wasted. So if one chooses to use, say, 16 MiB block size for
  metadata, the smallest unit of space that can be allocated is 512
  KiB. If one chooses 1 MiB block size, the smallest allocation unit is
  32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any
  reasonable data block size); directory blocks used to be limited to
  32 KiB, but in the current code can be as large as 256 KiB. As one
  can observe, using 16 MiB metadata block size would lead to a
  considerable amount of wasted space for IBs and large directories
  (small directories can live in inodes). On the other hand, with 1 MiB
  block size, there'll be no wasted metadata space. Does any of this
  actually make a practical difference? That depends on the file system
  composition, namely the number of IBs (which is a function of the
  number of large files) and larger directories. Calculating this
  scientifically can be pretty involved, and really should be the job
  of a tool that ought to exist, but doesn't (yet). A more practical
  approach is doing a ballpark estimate using local file counts and
  typical fractions of large files and directories, using statistics
  available from published papers.

  The performance implications of a given metadata block size choice is
  a subject of nearly infinite depth, and this question ultimately can
  only be answered by doing experiments with a specific workload on
  specific hardware. The m

Re: [gpfsug-discuss] Blocksize

2016-09-26 Thread Stephen Ulmer
Now I’ve got anther question… which I’ll let bake for a while.

Okay, to (poorly) summarize:
There are items OTHER THAN INODES stored as metadata in GPFS.
These items have a VARIETY OF SIZES, but are packed in such a way that we 
should just not worry about wasted space unless we pick a LARGE metadata block 
size — or if we don’t pick a “reasonable” metadata block size after picking a 
“large” file system block size that applies to both.
Performance is hard, and the gain from calculating exactly the best metadata 
block size is much smaller than performance gains attained through code 
optimization.
If we were to try and calculate the appropriate metadata block size we would 
likely be wrong anyway, since none of us get our data at the idealized physics 
shop that sells massless rulers and frictionless pulleys.
We should probably all use a metadata block size around 1MB. Nobody has said 
this outright, but it’s been the example as the “good” size at least three 
times in this thread.
Under no circumstances should we do what many of us would have done and pick 
128K, which made sense based on all of our previous education that is no longer 
applicable.

Did I miss anything? :)

Liberty,

-- 
Stephen



> On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev  wrote:
> 
> It's important to understand the differences between different metadata 
> types, in particular where it comes to space allocation.
> 
> System metadata files (inode file, inode and block allocation maps, ACL file, 
> fileset metadata file, EA file in older versions) are allocated at 
> well-defined moments (file system format, new storage pool creation in the 
> case of block allocation map, etc), and those contain multiple records packed 
> into a single block. From the block allocator point of view, the individual 
> metadata record size is invisible, only larger blocks get actually allocated, 
> and space usage efficiency generally isn't an issue.
> 
> For user metadata (indirect blocks, directory blocks, EA overflow blocks) the 
> situation is different. Those get allocated as the need arises, generally one 
> at a time. So the size of an individual metadata structure matters, a lot. 
> The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If 
> an IB or a directory block is smaller than a subblock, the unused space in 
> the subblock is wasted. So if one chooses to use, say, 16 MiB block size for 
> metadata, the smallest unit of space that can be allocated is 512 KiB. If one 
> chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are 
> generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block 
> size); directory blocks used to be limited to 32 KiB, but in the current code 
> can be as large as 256 KiB. As one can observe, using 16 MiB metadata block 
> size would lead to a considerable amount of wasted space for IBs and large 
> directories (small directories can live in inodes). On the other hand, with 1 
> MiB block size, there'll be no wasted metadata space. Does any of this 
> actually make a practical difference? That depends on the file system 
> composition, namely the number of IBs (which is a function of the number of 
> large files) and larger directories. Calculating this scientifically can be 
> pretty involved, and really should be the job of a tool that ought to exist, 
> but doesn't (yet). A more practical approach is doing a ballpark estimate 
> using local file counts and typical fractions of large files and directories, 
> using statistics available from published papers.
> 
> The performance implications of a given metadata block size choice is a 
> subject of nearly infinite depth, and this question ultimately can only be 
> answered by doing experiments with a specific workload on specific hardware. 
> The metadata space utilization efficiency is something that can be answered 
> conclusively though.
> 
> yuri
> 
> "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am 
> confused by your statement that the metadata block size should be 1 MB and am 
> very int
> 
> From: "Buterbaugh, Kevin L" 
> To: gpfsug main discussion list , 
> Date: 09/24/2016 07:19 AM
> Subject: Re: [gpfsug-discuss] Blocksize
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> 
> 
> 
> 
> Hi Sven, 
> 
> I am confused by your statement that the metadata block size should be 1 MB 
> and am very interested in learning the rationale behind this as I am 
> currently looking at all aspects of our current GPFS configuration and the 
> possibility of making major changes.
> 
> If you have a filesystem with only metadataOnly disks in the system pool and 
> the default size of an inode is 4K (which we would do, since we have recently 
> discovered that eve

Re: [gpfsug-discuss] Blocksize

2016-09-26 Thread Yuri L Volobuev
It's important to understand the differences between different metadata
types, in particular where it comes to space allocation.

System metadata files (inode file, inode and block allocation maps, ACL
file, fileset metadata file, EA file in older versions) are allocated at
well-defined moments (file system format,  new storage pool creation in the
case of block allocation map, etc), and those contain multiple records
packed into a single block.  From the block allocator point of view, the
individual metadata record size is invisible, only larger blocks get
actually allocated, and space usage efficiency generally isn't an issue.

For user metadata (indirect blocks, directory blocks, EA overflow blocks)
the situation is different.  Those get allocated as the need arises,
generally one at a time.  So the size of an individual metadata structure
matters, a lot.  The smallest unit of allocation in GPFS is a subblock
(1/32nd of a block).  If an IB or a directory block is smaller than a
subblock, the unused space in the subblock is wasted.  So if one chooses to
use, say, 16 MiB block size for metadata, the smallest unit of space that
can be allocated is 512 KiB.  If one chooses 1 MiB block size, the smallest
allocation unit is 32 KiB.  IBs are generally 16 KiB or 32 KiB in size (32
KiB with any reasonable data block size); directory blocks used to be
limited to 32 KiB, but in the current code can be as large as 256 KiB.  As
one can observe, using 16 MiB metadata block size would lead to a
considerable amount of wasted space for IBs and large directories (small
directories can live in inodes).  On the other hand, with 1 MiB block size,
there'll be no wasted metadata space.  Does any of this actually make a
practical difference?  That depends on the file system composition, namely
the number of IBs (which is a function of the number of large files) and
larger directories.  Calculating this scientifically can be pretty
involved, and really should be the job of a tool that ought to exist, but
doesn't (yet).  A more practical approach is doing a ballpark estimate
using local file counts and typical fractions of large files and
directories, using statistics available from published papers.

The performance implications of a given metadata block size choice is a
subject of nearly infinite depth, and this question ultimately can only be
answered by doing experiments with a specific workload on specific
hardware.  The metadata space utilization efficiency is something that can
be answered conclusively though.

yuri



From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list ,
Date:   09/24/2016 07:19 AM
Subject:Re: [gpfsug-discuss] Blocksize
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Sven,

I am confused by your statement that the metadata block size should be 1 MB
and am very interested in learning the rationale behind this as I am
currently looking at all aspects of our current GPFS configuration and the
possibility of making major changes.

If you have a filesystem with only metadataOnly disks in the system pool
and the default size of an inode is 4K (which we would do, since we have
recently discovered that even on our scratch filesystem we have a bazillion
files that are 4K or smaller and could therefore have their data stored in
the inode, right?), then why would you set the metadata block size to
anything larger than 128K when a sub-block is 1/32nd of a block?  I.e.,
with a 1 MB block size for metadata wouldn’t you be wasting  a massive
amount of space?

What am I missing / confused about there?

Oh, and here’s a related question … let’s just say I have the above
configuration … my system pool is metadata only and is on SSD’s.  Then I
have two other dataOnly pools that are spinning disk.  One is for “regular”
access and the other is the “capacity” pool … i.e. a pool of slower storage
where we move files with large access times.  I have a policy that says
something like “move all files with an access time > 6 months to the
capacity pool.”  Of those bazillion files less than 4K in size that are
fitting in the inode currently, probably half a bazillion () of them
would be subject to that rule.  Will they get moved to the spinning disk
capacity pool or will they stay in the inode??

Thanks!  This is a very timely and interesting discussion for me as well...

Kevin

  On Sep 23, 2016, at 4:35 PM, Sven Oehme  wrote:



  your metadata block size these days should be 1 MB and there are only
  very few workloads for which you should run with a filesystem
  blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB
  is a good starting point.
  the general rule still applies that your filesystem blocksize
  (metadata or data pool) should match your raid controller (or GNR
  vdisk) stripe size of the particular pool.

  so if you use a 128k strip size(defaut in many midrange storage
  controllers

Re: [gpfsug-discuss] Blocksize

2016-09-25 Thread Sven Oehme
well, its not that easy and there is no perfect answer here. so lets start
with some data points that might help decide:

inodes, directory blocks, allocation maps for data as well as metadata
don't follow the same restrictions as data 'fragments' or subblocks, means
they are not bond to the 1/32 of the blocksize. they rather get organized
on calculated sized blocks which can be very small (significant smaller
than 1/32th) or close to the max of the blocksize for a single object.
therefore the space waste concern doesn't really apply here.

policy scans loves larger blocks as the blocks will be randomly scattered
across the NSD's and therefore larger contiguous blocks for inode scan will
perform significantly faster on  larger metadata blocksizes than on smaller
(assuming this is disk, with SSD's this doesn't matter that much)

so for disk based systems it is advantageous to use larger blocks , for SSD
based its less of an issue. you shouldn't choose on the other hand too
large blocks even for disk drive based systems as there is one catch to all
this. small updates on metadata typically end up writing the whole metadata
block e.g. 256k for a directory block which now need to be destaged and
read back from another node changing the same block.

hope this helps. Sven





On Sat, Sep 24, 2016 at 7:18 AM Buterbaugh, Kevin L <
kevin.buterba...@vanderbilt.edu> wrote:

> Hi Sven,
>
> I am confused by your statement that the metadata block size should be 1
> MB and am very interested in learning the rationale behind this as I am
> currently looking at all aspects of our current GPFS configuration and the
> possibility of making major changes.
>
> If you have a filesystem with only metadataOnly disks in the system pool
> and the default size of an inode is 4K (which we would do, since we have
> recently discovered that even on our scratch filesystem we have a bazillion
> files that are 4K or smaller and could therefore have their data stored in
> the inode, right?), then why would you set the metadata block size to
> anything larger than 128K when a sub-block is 1/32nd of a block?  I.e.,
> with a 1 MB block size for metadata wouldn’t you be wasting  a massive
> amount of space?
>
> What am I missing / confused about there?
>
> Oh, and here’s a related question … let’s just say I have the above
> configuration … my system pool is metadata only and is on SSD’s.  Then I
> have two other dataOnly pools that are spinning disk.  One is for “regular”
> access and the other is the “capacity” pool … i.e. a pool of slower storage
> where we move files with large access times.  I have a policy that says
> something like “move all files with an access time > 6 months to the
> capacity pool.”  Of those bazillion files less than 4K in size that are
> fitting in the inode currently, probably half a bazillion () of them
> would be subject to that rule.  Will they get moved to the spinning disk
> capacity pool or will they stay in the inode??
>
> Thanks!  This is a very timely and interesting discussion for me as well...
>
> Kevin
>
> On Sep 23, 2016, at 4:35 PM, Sven Oehme  wrote:
>
> your metadata block size these days should be 1 MB and there are only very
> few workloads for which you should run with a filesystem blocksize below 1
> MB. so if you don't know exactly what to pick, 1 MB is a good starting
> point.
> the general rule still applies that your filesystem blocksize (metadata or
> data pool) should match your raid controller (or GNR vdisk) stripe size of
> the particular pool.
>
> so if you use a 128k strip size(defaut in many midrange storage
> controllers) in a 8+2p raid array, your stripe or track size is 1 MB and
> therefore the blocksize of this pool should be 1 MB. i see many customers
> in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or
> above and your performance will be significant impacted by that.
>
> Sven
>
> --
> Sven Oehme
> Scalable Storage Research
> email: oeh...@us.ibm.com
> Phone: +1 (408) 824-8904
> IBM Almaden Research Lab
> --
>
> Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too
> pedantic, but I believe the the subblock size is 1/32 of the block size
> (which strengt
>
>
>
> From: Stephen Ulmer 
> To: gpfsug main discussion list 
> Date: 09/23/2016 12:16 PM
> Subject: Re: [gpfsug-discuss] Blocksize
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
>
> --
>
>
>
> Not to be too pedantic, but I believe the the subblock size is 1/32 of the
> block size (which strengthens Luis’s arguments below).
>
> I thought the the original question was NOT about inode size, but a

Re: [gpfsug-discuss] Blocksize - consider IO transfer efficiency above your other prejudices

2016-09-24 Thread Marc A Kaplan
(I can answer your basic questions, Sven has more experience with tuning 
very large file systems, so perhaps he will have more to say...)

1. Inodes are packed into the file of inodes. (There is one file of all 
the inodes in a filesystem). 

If you have metadata-blocksize 1MB you will have 256 of 4KB inodes per 
block.   Forget about sub-blocks when it comes to the file of inodes.

2. IF a file's data fits in its inode, then migrating that file from one 
pool to another just changes the preferred pool name in the inode.  No 
data movement.  Should the file later "grow" to require a data block, that 
data block will be allocated from whatever pool is named in the inode at 
that time.

See the email I posted earlier today.  Basically: FORGET what you thought 
you knew about optimal metadata blocksize (perhaps based on how you 
thought metadata was laid out on disk) and just stick to optimal IO 
transfer blocksizes. 

Yes, there may be contrived scenarios or even a few real live special 
cases, but those would be few and far between. 
Try following the newer general, easier, rule and see how well it works.




From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list 
Date:   09/24/2016 10:19 AM
Subject:    Re: [gpfsug-discuss] Blocksize
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Hi Sven, 

I am confused by your statement that the metadata block size should be 1 
MB and am very interested in learning the rationale behind this as I am 
currently looking at all aspects of our current GPFS configuration and the 
possibility of making major changes.

If you have a filesystem with only metadataOnly disks in the system pool 
and the default size of an inode is 4K (which we would do, since we have 
recently discovered that even on our scratch filesystem we have a 
bazillion files that are 4K or smaller and could therefore have their data 
stored in the inode, right?), then why would you set the metadata block 
size to anything larger than 128K when a sub-block is 1/32nd of a block? 
I.e., with a 1 MB block size for metadata wouldn’t you be wasting  a 
massive amount of space?

What am I missing / confused about there?

Oh, and here’s a related question … let’s just say I have the above 
configuration … my system pool is metadata only and is on SSD’s.  Then I 
have two other dataOnly pools that are spinning disk.  One is for 
“regular” access and the other is the “capacity” pool … i.e. a pool of 
slower storage where we move files with large access times.  I have a 
policy that says something like “move all files with an access time > 6 
months to the capacity pool.”  Of those bazillion files less than 4K in 
size that are fitting in the inode currently, probably half a bazillion 
() of them would be subject to that rule.  Will they get moved to 
the spinning disk capacity pool or will they stay in the inode??

Thanks!  This is a very timely and interesting discussion for me as 
well...

Kevin

On Sep 23, 2016, at 4:35 PM, Sven Oehme  wrote:

your metadata block size these days should be 1 MB and there are only very 
few workloads for which you should run with a filesystem blocksize below 1 
MB. so if you don't know exactly what to pick, 1 MB is a good starting 
point. 
the general rule still applies that your filesystem blocksize (metadata or 
data pool) should match your raid controller (or GNR vdisk) stripe size of 
the particular pool.

so if you use a 128k strip size(defaut in many midrange storage 
controllers) in a 8+2p raid array, your stripe or track size is 1 MB and 
therefore the blocksize of this pool should be 1 MB. i see many customers 
in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB 
or above and your performance will be significant impacted by that. 

Sven

--
Sven Oehme 
Scalable Storage Research 
email: oeh...@us.ibm.com 
Phone: +1 (408) 824-8904 
IBM Almaden Research Lab 
--

Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too 
pedantic, but I believe the the subblock size is 1/32 of the block size 
(which strengt

From: Stephen Ulmer 
To: gpfsug main discussion list 
Date: 09/23/2016 12:16 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-boun...@spectrumscale.org



Not to be too pedantic, but I believe the the subblock size is 1/32 of the 
block size (which strengthens Luis’s arguments below).

I thought the the original question was NOT about inode size, but about 
metadata block size. You can specify that the system pool have a different 
block size from the rest of the filesystem, providing that it ONLY holds 
metadata (—metadata-block-size option to mmcrfs).

So with 4K inodes (which should be used for all new filesystems without 
some counter-indication), I would think that we’d want to use a metadata 
block size of 4K*32=128K. This is independent of the regular block size, 
which you can calculat

Re: [gpfsug-discuss] Blocksize and MetaData Blocksizes - FORGET the old advice

2016-09-24 Thread Marc A Kaplan
Metadata is inodes, directories, indirect blocks (indices). 

Spectrum Scale (GPFS) Version 4.1 introduced significant improvements to 
the data structures used to represent directories.
Larger inodes supporting data and extended attributes in the inode are 
other significant relatively recent improvements.

Now small directories are stored in the inode, while for large directories 
blocks can be bigger than 32MB, and any and all directory blocks that are 
smaller than
the metadata-blocksize, are allocated just like "fragments" - so 
directories are now space efficient.

SO MUCH SO, that THE OLD ADVICE, about using smallish blocksizes for 
metadata, GOES "OUT THE WINDOW".  Period. FORGET most of what you thought
you knew about "best" or "optimal" metadata-blocksize.

The new advice is, as Sven wrote: 

Use a blocksize that optimizes IO transfer efficiency and speed.
This is true for BOTH data and metadata.

Now, IF you have system pool set up as metadata only AND system pool is on 
devices that have a different "optimal" block size than your other pools,
THEN, it may make sense to use two different blocksizes, one for data and 
another for metadata.

For example, maybe you have massively striped RAID or RAID-LIKE (GSS or 
ESS)) storage for huge files - so maybe 8MB is a good blocksize for that.
But maybe you have your metadata on SSD devices and maybe 1MB is the 
"best" blocksize for that.


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-24 Thread Buterbaugh, Kevin L
Hi Sven,

I am confused by your statement that the metadata block size should be 1 MB and 
am very interested in learning the rationale behind this as I am currently 
looking at all aspects of our current GPFS configuration and the possibility of 
making major changes.

If you have a filesystem with only metadataOnly disks in the system pool and 
the default size of an inode is 4K (which we would do, since we have recently 
discovered that even on our scratch filesystem we have a bazillion files that 
are 4K or smaller and could therefore have their data stored in the inode, 
right?), then why would you set the metadata block size to anything larger than 
128K when a sub-block is 1/32nd of a block?  I.e., with a 1 MB block size for 
metadata wouldn’t you be wasting  a massive amount of space?

What am I missing / confused about there?

Oh, and here’s a related question … let’s just say I have the above 
configuration … my system pool is metadata only and is on SSD’s.  Then I have 
two other dataOnly pools that are spinning disk.  One is for “regular” access 
and the other is the “capacity” pool … i.e. a pool of slower storage where we 
move files with large access times.  I have a policy that says something like 
“move all files with an access time > 6 months to the capacity pool.”  Of those 
bazillion files less than 4K in size that are fitting in the inode currently, 
probably half a bazillion () of them would be subject to that rule.  Will 
they get moved to the spinning disk capacity pool or will they stay in the 
inode??

Thanks!  This is a very timely and interesting discussion for me as well...

Kevin

On Sep 23, 2016, at 4:35 PM, Sven Oehme 
mailto:oeh...@us.ibm.com>> wrote:


your metadata block size these days should be 1 MB and there are only very few 
workloads for which you should run with a filesystem blocksize below 1 MB. so 
if you don't know exactly what to pick, 1 MB is a good starting point.
the general rule still applies that your filesystem blocksize (metadata or data 
pool) should match your raid controller (or GNR vdisk) stripe size of the 
particular pool.

so if you use a 128k strip size(defaut in many midrange storage controllers) in 
a 8+2p raid array, your stripe or track size is 1 MB and therefore the 
blocksize of this pool should be 1 MB. i see many customers in the field using 
1MB or even smaller blocksize on RAID stripes of 2 MB or above and your 
performance will be significant impacted by that.

Sven

--
Sven Oehme
Scalable Storage Research
email: oeh...@us.ibm.com<mailto:oeh...@us.ibm.com>
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
--

Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, 
but I believe the the subblock size is 1/32 of the block size (which strengt

From: Stephen Ulmer mailto:ul...@ulmer.org>>
To: gpfsug main discussion list 
mailto:gpfsug-discuss@spectrumscale.org>>
Date: 09/23/2016 12:16 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: 
gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org>





Not to be too pedantic, but I believe the the subblock size is 1/32 of the 
block size (which strengthens Luis’s arguments below).

I thought the the original question was NOT about inode size, but about 
metadata block size. You can specify that the system pool have a different 
block size from the rest of the filesystem, providing that it ONLY holds 
metadata (—metadata-block-size option to mmcrfs).

So with 4K inodes (which should be used for all new filesystems without some 
counter-indication), I would think that we’d want to use a metadata block size 
of 4K*32=128K. This is independent of the regular block size, which you can 
calculate based on the workload if you’re lucky.

There could be a great reason NOT to use 128K metadata block size, but I don’t 
know what it is. I’d be happy to be corrected about this if it’s out of whack.

--
Stephen



On Sep 22, 2016, at 3:37 PM, Luis Bolinches 
mailto:luis.bolinc...@fi.ibm.com>> wrote:

Hi

My 2 cents.

Leave at least 4K inodes, then you get massive improvement on small files (less 
3.5K minus whatever you use on xattr)

About blocksize for data, unless you have actual data that suggest that you 
will actually benefit from smaller than 1MB block, leave there. GPFS uses 
sublocks where 1/16th of the BS can be allocated to different files, so the 
"waste" is much less than you think on 1MB and you get the throughput and less 
structures of much more data blocks.

No warranty at all but I try to do this when the BS talk comes in: (might need 
some clean up it could not be last note but you get the idea)

POSIX
find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
GPFS
cd /usr/lpp/mmfs/samples/ilm
gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
.

Re: [gpfsug-discuss] Blocksize

2016-09-23 Thread Luis Bolinches
Not pendant but correct

I flip there it is 1/32

--
Cheers

> On 23 Sep 2016, at 22.16, Stephen Ulmer  wrote:
> 
> Not to be too pedantic, but I believe the the subblock size is 1/32 of the 
> block size (which strengthens Luis’s arguments below).
> 
> I thought the the original question was NOT about inode size, but about 
> metadata block size. You can specify that the system pool have a different 
> block size from the rest of the filesystem, providing that it ONLY holds 
> metadata (—metadata-block-size option to mmcrfs).
> 
> So with 4K inodes (which should be used for all new filesystems without some 
> counter-indication), I would think that we’d want to use a metadata block 
> size of 4K*32=128K. This is independent of the regular block size, which you 
> can calculate based on the workload if you’re lucky.
> 
> There could be a great reason NOT to use 128K metadata block size, but I 
> don’t know what it is. I’d be happy to be corrected about this if it’s out of 
> whack.
> 
> -- 
> Stephen
> 
> 
> 
>> On Sep 22, 2016, at 3:37 PM, Luis Bolinches  
>> wrote:
>> 
>> Hi
>>  
>> My 2 cents.
>>  
>> Leave at least 4K inodes, then you get massive improvement on small files 
>> (less 3.5K minus whatever you use on xattr)
>>  
>> About blocksize for data, unless you have actual data that suggest that you 
>> will actually benefit from smaller than 1MB block, leave there. GPFS uses 
>> sublocks where 1/16th of the BS can be allocated to different files, so the 
>> "waste" is much less than you think on 1MB and you get the throughput and 
>> less structures of much more data blocks.
>>  
>> No warranty at all but I try to do this when the BS talk comes in: (might 
>> need some clean up it could not be last note but you get the idea)
>>  
>> POSIX
>> find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
>> GPFS
>> cd /usr/lpp/mmfs/samples/ilm
>> gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
>> ./mmfind /gpfs/shared -ls -type f > find_ls_files.out
>> CONVERT to CSV
>> 
>> POSIX
>> cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv
>> GPFS
>> cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
>> LOAD in octave
>> 
>> FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
>> Clean the second column (OPTIONAL as the next clean up will do the same)
>> 
>> FILESIZE(:,[2]) = [];
>> If we are on 4K aligment we need to clean the files that go to inodes 
>> (WELL not exactly ... extended attributes! so maybe use a lower number!)
>> 
>> FILESIZE(FILESIZE<=3584) =[];
>> If we are not we need to clean the 0 size files
>> 
>> FILESIZE(FILESIZE==0) =[];
>> Median
>> 
>> FILESIZEMEDIAN = int32 (median (FILESIZE))
>> Mean
>> 
>> FILESIZEMEAN = int32 (mean (FILESIZE))
>> Variance
>> 
>> int32 (var (FILESIZE))
>> iqr interquartile range, i.e., the difference between the upper and 
>> lower quartile, of the input data.
>> 
>> int32 (iqr (FILESIZE))
>> Standard deviation
>>  
>>  
>> For some FS with lots of files you might need a rather powerful machine to 
>> run the calculations on octave, I never hit anything could not manage on a 
>> 64GB RAM Power box. Most of the times it is enough with my laptop.
>>  
>>  
>> 
>> --
>> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>> 
>> Luis Bolinches
>> Lab Services
>> http://www-03.ibm.com/systems/services/labservices/
>> 
>> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
>> Phone: +358 503112585
>> 
>> "If you continually give you will continually have." Anonymous
>>  
>>  
>> - Original message -
>> From: Stef Coene 
>> Sent by: gpfsug-discuss-boun...@spectrumscale.org
>> To: gpfsug main discussion list 
>> Cc:
>> Subject: Re: [gpfsug-discuss] Blocksize
>> Date: Thu, Sep 22, 2016 10:30 PM
>>  
>> On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:
>> > It defaults to 4k:
>> > mmlsfs testbs8M -i
>> > flagvaluedescription
>> > --- 
>> > ---
>> >  -i 4096 Inode size in bytes
>> >
>> > I think you can make as small as 512b.   G

Re: [gpfsug-discuss] Blocksize

2016-09-23 Thread Brian Marshall
To keep this great chain going:

If my metadata is on FLASH, would having a smaller blocksize for the system
pool (metadata only) be helpful.

My filesystem blocksize is 8MB

On Fri, Sep 23, 2016 at 6:34 PM, Stef Coene  wrote:

> On 09/22/2016 08:36 PM, Stef Coene wrote:
>
>> Hi,
>>
>> Is it needed to specify a different blocksize for the system pool that
>> holds the metadata?
>>
>> IBM recommends a 1 MB blocksize for the file system.
>> But I wonder a smaller blocksize (256 KB or so) for metadata is a good
>> idea or not...
>>
> I have read the replies and at the end, this is what we will do:
> Since the back-end storage will be V5000 with a default stripe size of
> 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is
> the best choice for block size.
> So 2 MB block size for data is the best choice.
>
> Since the block size for metadata is not that important in the latest
> releases, we will also go for 2 MB block size for metadata.
>
> Inode size will be left at the default: 4 KB.
>
>
>
> Stef
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-23 Thread Stef Coene

On 09/22/2016 08:36 PM, Stef Coene wrote:

Hi,

Is it needed to specify a different blocksize for the system pool that
holds the metadata?

IBM recommends a 1 MB blocksize for the file system.
But I wonder a smaller blocksize (256 KB or so) for metadata is a good
idea or not...

I have read the replies and at the end, this is what we will do:
Since the back-end storage will be V5000 with a default stripe size of 
256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M 
is the best choice for block size.

So 2 MB block size for data is the best choice.

Since the block size for metadata is not that important in the latest 
releases, we will also go for 2 MB block size for metadata.


Inode size will be left at the default: 4 KB.


Stef
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-23 Thread Sven Oehme
your metadata block size these days should be 1 MB and there are only very
few workloads for which you should run with a filesystem blocksize below 1
MB. so if you don't know exactly what to pick, 1 MB is a good starting
point.
the general rule still applies that your filesystem blocksize (metadata or
data pool) should match your raid controller (or GNR vdisk) stripe size of
the particular pool.

so if you use a 128k strip size(defaut in many midrange storage
controllers) in a 8+2p raid array, your stripe or track size is 1 MB and
therefore the blocksize of this pool should be 1 MB. i see many customers
in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or
above and your performance will be significant impacted by that.

Sven

--
Sven Oehme
Scalable Storage Research
email: oeh...@us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
--



From:   Stephen Ulmer 
To: gpfsug main discussion list 
Date:   09/23/2016 12:16 PM
Subject:        Re: [gpfsug-discuss] Blocksize
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Not to be too pedantic, but I believe the the subblock size is 1/32 of the
block size (which strengthens Luis’s arguments below).

I thought the the original question was NOT about inode size, but about
metadata block size. You can specify that the system pool have a different
block size from the rest of the filesystem, providing that it ONLY holds
metadata (—metadata-block-size option to mmcrfs).

So with 4K inodes (which should be used for all new filesystems without
some counter-indication), I would think that we’d want to use a metadata
block size of 4K*32=128K. This is independent of the regular block size,
which you can calculate based on the workload if you’re lucky.

There could be a great reason NOT to use 128K metadata block size, but I
don’t know what it is. I’d be happy to be corrected about this if it’s out
of whack.

--
Stephen



  On Sep 22, 2016, at 3:37 PM, Luis Bolinches <
  luis.bolinc...@fi.ibm.com> wrote:

  Hi

  My 2 cents.

  Leave at least 4K inodes, then you get massive improvement on small
  files (less 3.5K minus whatever you use on xattr)

  About blocksize for data, unless you have actual data that suggest
  that you will actually benefit from smaller than 1MB block, leave
  there. GPFS uses sublocks where 1/16th of the BS can be allocated to
  different files, so the "waste" is much less than you think on 1MB
  and you get the throughput and less structures of much more data
  blocks.

  No warranty at all but I try to do this when the BS talk comes in:
  (might need some clean up it could not be last note but you get the
  idea)

  POSIX
  find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
  GPFS
  cd /usr/lpp/mmfs/samples/ilm
  gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
  ./mmfind /gpfs/shared -ls -type f > find_ls_files.out
  CONVERT to CSV

  POSIX
  cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv
  GPFS
  cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
  LOAD in octave

  FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
  Clean the second column (OPTIONAL as the next clean up will do
  the same)

  FILESIZE(:,[2]) = [];
  If we are on 4K aligment we need to clean the files that go to
  inodes (WELL not exactly ... extended attributes! so maybe use a
  lower number!)

  FILESIZE(FILESIZE<=3584) =[];
  If we are not we need to clean the 0 size files

  FILESIZE(FILESIZE==0) =[];
  Median

  FILESIZEMEDIAN = int32 (median (FILESIZE))
  Mean

  FILESIZEMEAN = int32 (mean (FILESIZE))
  Variance

  int32 (var (FILESIZE))
  iqr interquartile range, i.e., the difference between the upper
  and lower quartile, of the input data.

  int32 (iqr (FILESIZE))
  Standard deviation


  For some FS with lots of files you might need a rather powerful
  machine to run the calculations on octave, I never hit anything could
  not manage on a 64GB RAM Power box. Most of the times it is enough
  with my laptop.



  --
  Ystävällisin terveisin / Kind regards / Saludos cordiales /
  Salutations

  Luis Bolinches
  Lab Services
  http://www-03.ibm.com/systems/services/labservices/

  IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
  Phone: +358 503112585

  "If you continually give you will continually have." Anonymous


   - Original message -
   From: Stef Coene 
   Sent by: gpfsug-discuss-boun...@spectrumscale.org
   To: gpfsug main discussion list

Re: [gpfsug-discuss] Blocksize

2016-09-23 Thread Stephen Ulmer
Not to be too pedantic, but I believe the the subblock size is 1/32 of the 
block size (which strengthens Luis’s arguments below).

I thought the the original question was NOT about inode size, but about 
metadata block size. You can specify that the system pool have a different 
block size from the rest of the filesystem, providing that it ONLY holds 
metadata (—metadata-block-size option to mmcrfs).

So with 4K inodes (which should be used for all new filesystems without some 
counter-indication), I would think that we’d want to use a metadata block size 
of 4K*32=128K. This is independent of the regular block size, which you can 
calculate based on the workload if you’re lucky.

There could be a great reason NOT to use 128K metadata block size, but I don’t 
know what it is. I’d be happy to be corrected about this if it’s out of whack.

-- 
Stephen



> On Sep 22, 2016, at 3:37 PM, Luis Bolinches  wrote:
> 
> Hi
>  
> My 2 cents.
>  
> Leave at least 4K inodes, then you get massive improvement on small files 
> (less 3.5K minus whatever you use on xattr)
>  
> About blocksize for data, unless you have actual data that suggest that you 
> will actually benefit from smaller than 1MB block, leave there. GPFS uses 
> sublocks where 1/16th of the BS can be allocated to different files, so the 
> "waste" is much less than you think on 1MB and you get the throughput and 
> less structures of much more data blocks.
>  
> No warranty at all but I try to do this when the BS talk comes in: (might 
> need some clean up it could not be last note but you get the idea)
>  
> POSIX
> find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
> GPFS
> cd /usr/lpp/mmfs/samples/ilm
> gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
> ./mmfind /gpfs/shared -ls -type f > find_ls_files.out
> CONVERT to CSV
> 
> POSIX
> cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv
> GPFS
> cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
> LOAD in octave
> 
> FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
> Clean the second column (OPTIONAL as the next clean up will do the same)
> 
> FILESIZE(:,[2]) = [];
> If we are on 4K aligment we need to clean the files that go to inodes 
> (WELL not exactly ... extended attributes! so maybe use a lower number!)
> 
> FILESIZE(FILESIZE<=3584) =[];
> If we are not we need to clean the 0 size files
> 
> FILESIZE(FILESIZE==0) =[];
> Median
> 
> FILESIZEMEDIAN = int32 (median (FILESIZE))
> Mean
> 
> FILESIZEMEAN = int32 (mean (FILESIZE))
> Variance
> 
> int32 (var (FILESIZE))
> iqr interquartile range, i.e., the difference between the upper and lower 
> quartile, of the input data.
> 
> int32 (iqr (FILESIZE))
> Standard deviation
>  
>  
> For some FS with lots of files you might need a rather powerful machine to 
> run the calculations on octave, I never hit anything could not manage on a 
> 64GB RAM Power box. Most of the times it is enough with my laptop.
>  
>  
> 
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
> 
> Luis Bolinches
> Lab Services
> http://www-03.ibm.com/systems/services/labservices/
> 
> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
> Phone: +358 503112585
> 
> "If you continually give you will continually have." Anonymous
>  
>  
> - Original message -
> From: Stef Coene 
> Sent by: gpfsug-discuss-boun...@spectrumscale.org
> To: gpfsug main discussion list 
> Cc:
> Subject: Re: [gpfsug-discuss] Blocksize
> Date: Thu, Sep 22, 2016 10:30 PM
>  
> On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:
> > It defaults to 4k:
> > mmlsfs testbs8M -i
> > flagvaluedescription
> > --- 
> > ---
> >  -i 4096 Inode size in bytes
> >
> > I think you can make as small as 512b.   Gpfs will store very small
> > files in the inode.
> >
> > Typically you want your average file size to be your blocksize and your
> > filesystem has one blocksize and one inodesize.
> 
> The files are not small, but around 20 MB on average.
> So I calculated with IBM that a 1 MB or 2 MB block size is best.
> 
> But I'm not sure if it's better to use a smaller block size for the
> metadata.
> 
> The file system is not that large (400 TB) and will hold backup data
> from CommVault.
> 
> 
> Stef
> ___
&g

Re: [gpfsug-discuss] Blocksize

2016-09-22 Thread Yuri L Volobuev

The current (V4.2+) levels of code support bigger directory block sizes, so
it's no longer an issue with something like 1M metadata block size.  In
fact, there isn't a whole lot of difference between 256K and 1M metadata
block sizes, either would work fine.  There isn't really a downside in
selecting a different block size for metadata though.

Inode size (mmcrfs -i option) is orthogonal to the metadata block size
selection.  We do strongly recommend using 4K inodes to anyone.  There's
the obvious downside of needing more metadata storage for inodes, but the
advantages are significant.
yuri



From:   Jan-Frode Myklebust 
To: gpfsug main discussion list ,
Date:   09/22/2016 12:25 PM
Subject:    Re: [gpfsug-discuss] Blocksize
Sent by:gpfsug-discuss-boun...@spectrumscale.org



https://www.ibm.com/developerworks/community/forums/html/topic?id=----14774266

"Use 256K. Anything smaller makes allocation blocks for the inode file
inefficient. Anything larger wastes space for directories. These are the
two largest consumers of metadata space." --dlmcnabb

A bit old, but I would assume it still applies.


  -jf


On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene  wrote:
  Hi,

  Is it needed to specify a different blocksize for the system pool that
  holds the metadata?

  IBM recommends a 1 MB blocksize for the file system.
  But I wonder a smaller blocksize (256 KB or so) for metadata is a good
  idea or not...


  Stef
  ___
  gpfsug-discuss mailing list
  gpfsug-discuss at spectrumscale.org
  http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize and space and performance for Metadata, release 4.2.x

2016-09-22 Thread Marc A Kaplan
There have been a few changes over the years that may invalidate some of 
the old advice about metadata and disk allocations there for.
These have been phased in over the last few years, I am discussing the 
present situation for release 4.2.x

1) Inode size.  Used to be 512.  Now you can set the inodesize at mmcrfs 
time.  Defaults to 4096.

2) Data in inode.  If it fits, then the inode holds the data.  Since a 512 
byte inode still works, you can have more than 3.5KB of data in a 4KB 
inode.

3) Extended Attributes in Inode.  Again, if it fits...  Extended 
attributes used to be stored in a separate file of metadata.  So extended 
attributes performance is way better than the old days.

4) (small) Directories in Inode.  If it fits, the inode of a directory can 
hold the directory entries.  That gives you about 2x performance on 
directory reads, for smallish directories.

5) Big directory blocks.  Directories used to use a maximum of 32KB per 
block, potentially wasting a lot of space and yielding poor performance 
for large directories.
Now directory blocks are the lesser of metadata-blocksize and 256KB.

6) Big directories are shrinkable.  Used to be directories would grow in 
32KB chunks but never shrink.  Yup, even an almost(?) "empty" directory 
would remain the size the directory had to be at its lifetime maximum. 
That means just a few remaining entries could be "sprinkled" over many 
directory blocks.  (See also 5.)
But now directories autoshrink to avoid wasteful sparsity.  Last I looked, 
the implementation just stopped short of "pushing" tiny directories back 
into the inode. But a huge directory can be shrunk down to a single 
(meta)data block.   (See --compact in the docs.)

--marc of GPFS

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-22 Thread Luis Bolinches
Hi
 
My 2 cents.
 
Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr)
 
About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks.
 
No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea)
 
POSIXfind . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
GPFScd /usr/lpp/mmfs/samples/ilmgcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile./mmfind /gpfs/shared -ls -type f > find_ls_files.out
    CONVERT to CSV
POSIXcat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csvGPFScat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
    LOAD in octave
FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
    Clean the second column (OPTIONAL as the next clean up will do the same)
FILESIZE(:,[2]) = [];
    If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!)
FILESIZE(FILESIZE<=3584) =[];
    If we are not we need to clean the 0 size files
FILESIZE(FILESIZE==0) =[];
    Median
FILESIZEMEDIAN = int32 (median (FILESIZE))
    Mean
FILESIZEMEAN = int32 (mean (FILESIZE))
    Variance
int32 (var (FILESIZE))
    iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data.
int32 (iqr (FILESIZE))
    Standard deviation
 
 
For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop.
 
 
--Ystävällisin terveisin / Kind regards / Saludos cordiales / SalutationsLuis BolinchesLab Serviceshttp://www-03.ibm.com/systems/services/labservices/IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 FinlandPhone: +358 503112585"If you continually give you will continually have." Anonymous
 
 
- Original message -From: Stef Coene Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug main discussion list Cc:Subject: Re: [gpfsug-discuss] BlocksizeDate: Thu, Sep 22, 2016 10:30 PM 
On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:> It defaults to 4k:> mmlsfs testbs8M -i> flag                value                    description> --- > --->  -i                 4096                     Inode size in bytes>> I think you can make as small as 512b.   Gpfs will store very small> files in the inode.>> Typically you want your average file size to be your blocksize and your> filesystem has one blocksize and one inodesize.The files are not small, but around 20 MB on average.So I calculated with IBM that a 1 MB or 2 MB block size is best.But I'm not sure if it's better to use a smaller block size for themetadata.The file system is not that large (400 TB) and will hold backup datafrom CommVault.Stef___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss 
 
Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3 
Registered in Finland


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-22 Thread Stef Coene

On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:

It defaults to 4k:
mmlsfs testbs8M -i
flagvaluedescription
--- 
---
 -i 4096 Inode size in bytes

I think you can make as small as 512b.   Gpfs will store very small
files in the inode.

Typically you want your average file size to be your blocksize and your
filesystem has one blocksize and one inodesize.


The files are not small, but around 20 MB on average.
So I calculated with IBM that a 1 MB or 2 MB block size is best.

But I'm not sure if it's better to use a smaller block size for the 
metadata.


The file system is not that large (400 TB) and will hold backup data 
from CommVault.



Stef
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-22 Thread Jan-Frode Myklebust
https://www.ibm.com/developerworks/community/forums/html/topic?id=----14774266

"Use 256K. Anything smaller makes allocation blocks for the inode file
inefficient. Anything larger wastes space for directories. These are the
two largest consumers of metadata space." --dlmcnabb

A bit old, but I would assume it still applies.


  -jf


On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene  wrote:

> Hi,
>
> Is it needed to specify a different blocksize for the system pool that
> holds the metadata?
>
> IBM recommends a 1 MB blocksize for the file system.
> But I wonder a smaller blocksize (256 KB or so) for metadata is a good
> idea or not...
>
>
> Stef
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-22 Thread Wahl, Edward
This is a great idea.  However there are quite a few other things to consider:

-max file count? If you need say a couple of billion files, this will affect 
things.  
-wish to store small files in the system pool in late model SS/GPFS? 
-encryption?  No data will be stored in the system pool so large blocks for 
small file storage in system is pointless. 
-system pool replication?
-HDD vs SSD for system pool?
-xxD or array tuning recommendations from your vendor?
-streaming vs random IO? Do you have a single dedicated app that has 
performance like xxx?
-probably more I can't think of off the top of my head.  etc etc

Ed


From: gpfsug-discuss-boun...@spectrumscale.org 
[gpfsug-discuss-boun...@spectrumscale.org] on behalf of Stef Coene 
[stef.co...@docum.org]
Sent: Thursday, September 22, 2016 2:36 PM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] Blocksize

Hi,

Is it needed to specify a different blocksize for the system pool that
holds the metadata?

IBM recommends a 1 MB blocksize for the file system.
But I wonder a smaller blocksize (256 KB or so) for metadata is a good
idea or not...


Stef
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Blocksize

2016-09-22 Thread J. Eric Wonderley
It defaults to 4k:
mmlsfs testbs8M -i
flagvaluedescription
--- 
---
 -i 4096 Inode size in bytes

I think you can make as small as 512b.   Gpfs will store very small files
in the inode.

Typically you want your average file size to be your blocksize and your
filesystem has one blocksize and one inodesize.


On Thu, Sep 22, 2016 at 2:36 PM, Stef Coene  wrote:

> Hi,
>
> Is it needed to specify a different blocksize for the system pool that
> holds the metadata?
>
> IBM recommends a 1 MB blocksize for the file system.
> But I wonder a smaller blocksize (256 KB or so) for metadata is a good
> idea or not...
>
>
> Stef
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss