Re: [gpfsug-discuss] Blocksize
Are there any presentation available online that provide diagrams of the directory/file creation process and modifications in terms of how the blocks/inodes and indirect blocks etc are used. I would guess there are a few different cases that would need to be shown. This is the sort of thing that would great in a decent text book on GPFS (doesn't exist as far as I am aware.) Cheers, Greg From: gpfsug-discuss-boun...@spectrumscale.org [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Thursday, 29 September 2016 1:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root@n2 charts]# mmlsfs x2K -i flagvaluedescription --- --- -i 2048 Inode size in bytes Works for me! ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize - file size distribution
On Wed, 28 Sep 2016 10:34:05 -0400 Marc A Kaplan wrote: > Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... > SHOW rule) to gather the stats much faster. Should be minutes, not > hours. > I'll agree with the policy engine. Runs like a beast if you tune it a little for nodes and threads. Only takes a couple of minutes to collect info on over a hundred million files. Show where the data is now by pool and sort it by age with queries? quick hack up example. you could sort the mess on the front end fairly quickly. (use fileset or pool, etc as your storage needs) RULE '2yrold_files' LIST '2yrold_filelist.txt' SHOW (varchar(file_size) || ' ' || varchar(USER_ID) || ' ' || varchar(POOL_NAME)) WHERE DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) >= 730 AND DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME) < 1095 don't forget to run the engine with the -I defer for this kind of list/show policy. Ed -- Ed Wahl Ohio Supercomputer Center 614-292-9302 ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
OKAY, I'll say it again. inodes are PACKED into a single inode file. So a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no wasted space. (Of course if you have metadata replication = 2, then yes, double that. And yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.) And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good choice for your data distribution, to optimize packing of data and/or directories into inodes... Hmmm... I don't know why the doc leaves out 2048, perhaps a typo... mmcrfs x2K -i 2048 [root@n2 charts]# mmlsfs x2K -i flagvaluedescription --- --- -i 2048 Inode size in bytes Works for me! ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize - file size distribution
Consider using samples/ilm/mmfind (or mmapplypolicy with a LIST ... SHOW rule) to gather the stats much faster. Should be minutes, not hours. ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize - file size distribution
/usr/lpp/mmfs/samples/debugtools/filehist Look at the README in that directory. Bob Oesterlin Sr Storage Engineer, Nuance HPC Grid From: on behalf of "greg.lehm...@csiro.au" Reply-To: gpfsug main discussion list Date: Wednesday, September 28, 2016 at 2:40 AM To: "gpfsug-discuss@spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Blocksize I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-boun...@spectrumscale.org [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post … we already use SSDs for our metadata. Which leads me to yet another question … of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I’m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore … how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up … or I’m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there’s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there’s also a huge potential $$$ cost. Thanks again … I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev mailto:volob...@us.ibm.com>> wrote: > 1) Let’s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective … i.e. > how fast is an “ls -l” on my directory? Space savings aren’t > important, and how long policy scans or other “administrative” type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perfo
Re: [gpfsug-discuss] Blocksize
I am wondering what people use to produce a file size distribution report for their filesystems. Has everyone rolled their own or is there some goto app to use. Cheers, Greg From: gpfsug-discuss-boun...@spectrumscale.org [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, 28 September 2016 7:21 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Blocksize Hi All, Again, my thanks to all who responded to my last post. Let me begin by stating something I unintentionally omitted in my last post … we already use SSDs for our metadata. Which leads me to yet another question … of my three filesystems, two (/home and /scratch) are much older (created in 2010) and therefore currently have a 512 byte inode size. /data is newer and has a 4K inode size. Now if I combine /scratch and /data into one filesystem with a 4K inode size, the amount of space used by all the inodes coming over from /scratch is going to grow by a factor of eight unless I’m horribly confused. And I would assume I need to count the amount of space taken up by allocated inodes, not just used inodes. Therefore … how much space my metadata takes up just grew significantly in importance since: 1) metadata is on very expensive enterprise class, vendor certified SSDs, 2) we use RAID 1 mirrors of those SSDs, and 3) we have metadata replication set to two. Some of the information presented by Sven and Yuri seems to contradict each other in regards to how much space inodes take up … or I’m misunderstanding one or both of them! Leaving aside replication, if I use a 256K block size for my metadata and I use 4K inodes, are those inodes going to take up 4K each or are they going to take up 8K each (1/32nd of a 256K block)? By the way, I do have a file size / file age spreadsheet for each of my filesystems (which I would be willing to share with interested parties) and while I was not surprised to learn that I have over 10 million sub-1K files on /home, I was a bit surprised to find that I have almost as many sub-1K files on /scratch (and a few million more on /data). So there’s a huge potential win in having those files in the inode on SSD as opposed to on spinning disk, but there’s also a huge potential $$$ cost. Thanks again … I hope others are gaining useful information from this thread. I sure am! Kevin On Sep 27, 2016, at 1:26 PM, Yuri L Volobuev mailto:volob...@us.ibm.com>> wrote: > 1) Let’s assume that our overarching goal in configuring the block > size for metadata is performance from the user perspective … i.e. > how fast is an “ls -l” on my directory? Space savings aren’t > important, and how long policy scans or other “administrative” type > tasks take is not nearly as important as that directory listing. > Does that change the recommended metadata block size? The performance challenges for the "ls -l" scenario are quite different from the policy scan scenario, so the same rules do not necessarily apply. During "ls -l" the code has to read inodes one by one (there's some prefetching going on, to take the edge off for the actual 'ls' thread, but prefetching is still one inode at a time). Metadata block size doesn't really come into the picture in this case, but inode size can be important -- depending on the storage performance characteristics. Does the storage you use for metadata exhibit a meaningfully different latency for 4K random reads vs 512 byte random reads? In my personal experience, on any modern storage device the difference is non-existent; in fact many devices (like all flash-based storage) use 4K native physical block size, and merely emulate 512 byte "sectors", so there's no way to read less than 4K. So from the inode read latency point of view 4K vs 512B is most likely a wash, but then 4K inodes can help improve performance of other operations, e.g. readdir of a small directory which fits entirely into the inode. If you use xattrs (e.g. as a side effect of using HSM), 4K inodes definitely help, but allowing xattrs to be stored in the inode. Policy scans reads inodes in full blocks, and there both metadata block size and inode size matter. Larger blocks could improve the inode read performance, while larger inodes mean that the same number of blocks hold fewer inodes and thus more blocks need to be read. So the policy inode scan performance is benefited by larger metadata block size and smaller inodes. However, policy scans also have to perform a directory traversal step, and that step tends to dominate the runtime of the overall run, and using larger inodes actually helps to speed up traversal of smaller directories. So whether larger inodes help or hurt the policy scan performance depends, yet again, on your file system composition. Overall, I believe that with all angles considered, larger inodes help wit
Re: [gpfsug-discuss] Blocksize
ratch or /data. /home has tons of small > files - so small that a 64K block size is currently used. /scratch > and /data have a mixture, but a 1 MB block size is the “sweet spot” there. > > If you could “start all over” with the same hardware being the only > restriction, would you: > > a) merge /scratch and /data into one filesystem but keep /home > separate since the LUN sizes are so very different, or > b) merge all three into one filesystem and use storage pools so that > /home is just a separate pool within the one filesystem? And if you > chose this option would you assign different block sizes to the pools? It's not possible to have different block sizes for different data pools. We are very aware that many people would like to be able to do just that, but this is counter to where the product is going. Supporting different block sizes for different pools is actually pretty hard: it's tricky to describe a large file that has some blocks in poolA and some in poolB where poolB has a different block size (perhaps during a migration) with the existing inode/indirect block format where each disk address pointer addresses a block of fixed size. With some effort, and some changes to how block addressing works, it would be possible to implement the support for this. However, as I mentioned in another post in this thread, we don't really want to glorify manual block size selection any further, we want to move away from it, by addressing the reasons that drive different block size selection today (like disk space utilization and performance). I'd recommend calculating a file size distribution histogram for your file systems. You may, for example, discover that 80% of the small files you have in /home would fit into 4K inodes, and then the storage space efficiency gains for the remaining 20% don't justify the complexity of managing an extra file system with a small block size. We don't recommend using block sizes smaller than 256K, because smaller block size is not good for disk space allocation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev > mailto:volob...@us.ibm.com>> wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got > anther question… which I’ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer mailto:ul...@ulmer.org>> > To: gpfsug main discussion list > mailto:gpfsug-discuss@spectrumscale.org>>, > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: > gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org> > > > > Now I’ve got anther question… which I’ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size — or if we don’t pick a “reasonable” > metadata block size after picking a “large” file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. &g
Re: [gpfsug-discuss] Blocksize
cation code efficiency. It's a quadratic dependency: with a smaller block size, one block worth of the block allocation map covers that much less disk space, because each bit in the map covers fewer disk sectors, and fewer bits fit into a block. This means having to create a lot more block allocation map segments than what is needed for an ample level of parallelism. This hurts performance of many block allocation-related operations. I don't see a reason for /scratch and /data to be separate file systems, aside from perhaps failure domain considerations. yuri > On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev wrote: > > I would put the net summary this way: in GPFS, the "Goldilocks zone" > for metadata block size is 256K - 1M. If one plans to create a new > file system using GPFS V4.2+, 1M is a sound choice. > > In an ideal world, block size choice shouldn't really be a choice. > It's a low-level implementation detail that one day should go the > way of the manual ignition timing adjustment -- something that used > to be necessary in the olden days, and something that select > enthusiasts like to tweak to this day, but something that's > irrelevant for the overwhelming majority of the folks who just want > the engine to run. There's work being done in that general direction > in GPFS, but we aren't there yet. > > yuri > > Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got > anther question… which I’ll let bake for a while. Okay, to (poorly) summarize: > > From: Stephen Ulmer > To: gpfsug main discussion list , > Date: 09/26/2016 12:02 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > Now I’ve got anther question… which I’ll let bake for a while. > > Okay, to (poorly) summarize: > There are items OTHER THAN INODES stored as metadata in GPFS. > These items have a VARIETY OF SIZES, but are packed in such a way > that we should just not worry about wasted space unless we pick a > LARGE metadata block size — or if we don’t pick a “reasonable” > metadata block size after picking a “large” file system block size > that applies to both. > Performance is hard, and the gain from calculating exactly the best > metadata block size is much smaller than performance gains attained > through code optimization. > If we were to try and calculate the appropriate metadata block size > we would likely be wrong anyway, since none of us get our data at > the idealized physics shop that sells massless rulers and > frictionless pulleys. > We should probably all use a metadata block size around 1MB. Nobody > has said this outright, but it’s been the example as the “good” size > at least three times in this thread. > Under no circumstances should we do what many of us would have done > and pick 128K, which made sense based on all of our previous > education that is no longer applicable. > > Did I miss anything? :) > > Liberty, > > -- > Stephen > > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > It's important to understand the differences between different > metadata types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, > ACL file, fileset metadata file, EA file in older versions) are > allocated at well-defined moments (file system format, new storage > pool creation in the case of block allocation map, etc), and those > contain multiple records packed into a single block. From the block > allocator point of view, the individual metadata record size is > invisible, only larger blocks get actually allocated, and space > usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow > blocks) the situation is different. Those get allocated as the need > arises, generally one at a time. So the size of an individual > metadata structure matters, a lot. The smallest unit of allocation > in GPFS is a subblock (1/32nd of a block). If an IB or a directory > block is smaller than a subblock, the unused space in the subblock > is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 > KiB. If one chooses 1 MiB block size, the smallest allocation unit > is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with > any reasonable data block size); directory blocks used to be limited > to 32 KiB, but in the current code can be as large as 256 KiB. As > one can observe, using 16 MiB metadata block size would lead to a > considerable amount of wasted space for IBs and large directories > (small directories can live in inodes). On the other hand, with 1 > MiB block si
Re: [gpfsug-discuss] Blocksize
de optimization.If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys.We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it’s been the example as the “good” size at least three times in this thread.Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable.Did I miss anything? :)Liberty,--Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev <volob...@us.ibm.com> wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation.System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue.For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers.The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though.yuri"Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very intFrom: "Buterbaugh, Kevin L" <kevin.buterba...@vanderbilt.edu> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>,Date: 09/24/2016 07:19 AMSubject: Re: [gpfsug-discuss] BlocksizeSent by: gpfsug-discuss-boun...@spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes.If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn’t you be wasting a massive amount of space?What am I mis
Re: [gpfsug-discuss] Blocksize
On 09/27/2016 10:02 AM, Buterbaugh, Kevin L wrote: 1) Let’s assume that our overarching goal in configuring the block size for metadata is performance from the user perspective … i.e. how fast is an “ls -l” on my directory? Space savings aren’t important, and how long policy scans or other “administrative” type tasks take is not nearly as important as that directory listing. Does that change the recommended metadata block size? You need to put your metadata on SSDs. Make your SSDs the only members in your 'system' pool and put your other devices into another pool, and make that pool 'dataOnly'. If your SSDs are large enough to also hold some data, that's great; I typically do a migration policy to copy files smaller than filesystem block size (or definitely smaller than sub-block size) to the SSDs. Also, files smaller than 4k will usually fit into the inode (if you are using the 4k inode size). I have a system where the SSDs are regularly doing 6-7k IOPS for metadata stuff. If those same 7k IOPS were spread out over the slow data LUNs... which only have like 100 IOPS per 8+2P LUN... I'd be consuming 700 disks just for metadata IOPS. -- Alex Chekholko ch...@stanford.edu ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize, yea, inode size!
inode size will be a crucial choice in the scenario you describe. Consider the conflict: A large inode can hold a complete file or a complete directory. But the bigger the inode size, the less that fit in any given block size -- so when you have to read several inodes ... more IO, less likely that inodes you want are in the same block. ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
Yuri / Sven / anyone else who wants to jump in, First off, thank you very much for your answers. I’d like to follow up with a couple of more questions. 1) Let’s assume that our overarching goal in configuring the block size for metadata is performance from the user perspective … i.e. how fast is an “ls -l” on my directory? Space savings aren’t important, and how long policy scans or other “administrative” type tasks take is not nearly as important as that directory listing. Does that change the recommended metadata block size? 2) Let’s assume we have 3 filesystems, /home, /scratch (traditional HPC use for those two) and /data (project space). Our storage arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 mirror, and two hot spare drives. The RAID 1 mirrors are for /home, the RAID 6 LUNs are for /scratch or /data. /home has tons of small files - so small that a 64K block size is currently used. /scratch and /data have a mixture, but a 1 MB block size is the “sweet spot” there. If you could “start all over” with the same hardware being the only restriction, would you: a) merge /scratch and /data into one filesystem but keep /home separate since the LUN sizes are so very different, or b) merge all three into one filesystem and use storage pools so that /home is just a separate pool within the one filesystem? And if you chose this option would you assign different block sizes to the pools? Again, I’m asking these questions because I may have the opportunity to effectively “start all over” and want to make sure I’m doing things as optimally as possible. Thanks… Kevin On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev mailto:volob...@us.ibm.com>> wrote: I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got anther question… which I’ll let bake for a while. Okay, to (poorly) summarize: From: Stephen Ulmer mailto:ul...@ulmer.org>> To: gpfsug main discussion list mailto:gpfsug-discuss@spectrumscale.org>>, Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org> Now I’ve got anther question… which I’ll let bake for a while. Okay, to (poorly) summarize: * There are items OTHER THAN INODES stored as metadata in GPFS. * These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size — or if we don’t pick a “reasonable” metadata block size after picking a “large” file system block size that applies to both. * Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. * If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. * We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it’s been the example as the “good” size at least three times in this thread. * Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev mailto:volob...@us.ibm.com>> wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, d
Re: [gpfsug-discuss] Blocksize
I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice. In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet. yuri From: Stephen Ulmer To: gpfsug main discussion list , Date: 09/26/2016 12:02 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by:gpfsug-discuss-boun...@spectrumscale.org Now I’ve got anther question… which I’ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size — or if we don’t pick a “reasonable” metadata block size after picking a “large” file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it’s been the example as the “good” size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The m
Re: [gpfsug-discuss] Blocksize
Now I’ve got anther question… which I’ll let bake for a while. Okay, to (poorly) summarize: There are items OTHER THAN INODES stored as metadata in GPFS. These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size — or if we don’t pick a “reasonable” metadata block size after picking a “large” file system block size that applies to both. Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization. If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys. We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it’s been the example as the “good” size at least three times in this thread. Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable. Did I miss anything? :) Liberty, -- Stephen > On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev wrote: > > It's important to understand the differences between different metadata > types, in particular where it comes to space allocation. > > System metadata files (inode file, inode and block allocation maps, ACL file, > fileset metadata file, EA file in older versions) are allocated at > well-defined moments (file system format, new storage pool creation in the > case of block allocation map, etc), and those contain multiple records packed > into a single block. From the block allocator point of view, the individual > metadata record size is invisible, only larger blocks get actually allocated, > and space usage efficiency generally isn't an issue. > > For user metadata (indirect blocks, directory blocks, EA overflow blocks) the > situation is different. Those get allocated as the need arises, generally one > at a time. So the size of an individual metadata structure matters, a lot. > The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If > an IB or a directory block is smaller than a subblock, the unused space in > the subblock is wasted. So if one chooses to use, say, 16 MiB block size for > metadata, the smallest unit of space that can be allocated is 512 KiB. If one > chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are > generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block > size); directory blocks used to be limited to 32 KiB, but in the current code > can be as large as 256 KiB. As one can observe, using 16 MiB metadata block > size would lead to a considerable amount of wasted space for IBs and large > directories (small directories can live in inodes). On the other hand, with 1 > MiB block size, there'll be no wasted metadata space. Does any of this > actually make a practical difference? That depends on the file system > composition, namely the number of IBs (which is a function of the number of > large files) and larger directories. Calculating this scientifically can be > pretty involved, and really should be the job of a tool that ought to exist, > but doesn't (yet). A more practical approach is doing a ballpark estimate > using local file counts and typical fractions of large files and directories, > using statistics available from published papers. > > The performance implications of a given metadata block size choice is a > subject of nearly infinite depth, and this question ultimately can only be > answered by doing experiments with a specific workload on specific hardware. > The metadata space utilization efficiency is something that can be answered > conclusively though. > > yuri > > "Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am > confused by your statement that the metadata block size should be 1 MB and am > very int > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list , > Date: 09/24/2016 07:19 AM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > > > > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 MB > and am very interested in learning the rationale behind this as I am > currently looking at all aspects of our current GPFS configuration and the > possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool and > the default size of an inode is 4K (which we would do, since we have recently > discovered that eve
Re: [gpfsug-discuss] Blocksize
It's important to understand the differences between different metadata types, in particular where it comes to space allocation. System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue. For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers. The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though. yuri From: "Buterbaugh, Kevin L" To: gpfsug main discussion list , Date: 09/24/2016 07:19 AM Subject:Re: [gpfsug-discuss] Blocksize Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn’t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here’s a related question … let’s just say I have the above configuration … my system pool is metadata only and is on SSD’s. Then I have two other dataOnly pools that are spinning disk. One is for “regular” access and the other is the “capacity” pool … i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like “move all files with an access time > 6 months to the capacity pool.” Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers
Re: [gpfsug-discuss] Blocksize
well, its not that easy and there is no perfect answer here. so lets start with some data points that might help decide: inodes, directory blocks, allocation maps for data as well as metadata don't follow the same restrictions as data 'fragments' or subblocks, means they are not bond to the 1/32 of the blocksize. they rather get organized on calculated sized blocks which can be very small (significant smaller than 1/32th) or close to the max of the blocksize for a single object. therefore the space waste concern doesn't really apply here. policy scans loves larger blocks as the blocks will be randomly scattered across the NSD's and therefore larger contiguous blocks for inode scan will perform significantly faster on larger metadata blocksizes than on smaller (assuming this is disk, with SSD's this doesn't matter that much) so for disk based systems it is advantageous to use larger blocks , for SSD based its less of an issue. you shouldn't choose on the other hand too large blocks even for disk drive based systems as there is one catch to all this. small updates on metadata typically end up writing the whole metadata block e.g. 256k for a directory block which now need to be destaged and read back from another node changing the same block. hope this helps. Sven On Sat, Sep 24, 2016 at 7:18 AM Buterbaugh, Kevin L < kevin.buterba...@vanderbilt.edu> wrote: > Hi Sven, > > I am confused by your statement that the metadata block size should be 1 > MB and am very interested in learning the rationale behind this as I am > currently looking at all aspects of our current GPFS configuration and the > possibility of making major changes. > > If you have a filesystem with only metadataOnly disks in the system pool > and the default size of an inode is 4K (which we would do, since we have > recently discovered that even on our scratch filesystem we have a bazillion > files that are 4K or smaller and could therefore have their data stored in > the inode, right?), then why would you set the metadata block size to > anything larger than 128K when a sub-block is 1/32nd of a block? I.e., > with a 1 MB block size for metadata wouldn’t you be wasting a massive > amount of space? > > What am I missing / confused about there? > > Oh, and here’s a related question … let’s just say I have the above > configuration … my system pool is metadata only and is on SSD’s. Then I > have two other dataOnly pools that are spinning disk. One is for “regular” > access and the other is the “capacity” pool … i.e. a pool of slower storage > where we move files with large access times. I have a policy that says > something like “move all files with an access time > 6 months to the > capacity pool.” Of those bazillion files less than 4K in size that are > fitting in the inode currently, probably half a bazillion () of them > would be subject to that rule. Will they get moved to the spinning disk > capacity pool or will they stay in the inode?? > > Thanks! This is a very timely and interesting discussion for me as well... > > Kevin > > On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: > > your metadata block size these days should be 1 MB and there are only very > few workloads for which you should run with a filesystem blocksize below 1 > MB. so if you don't know exactly what to pick, 1 MB is a good starting > point. > the general rule still applies that your filesystem blocksize (metadata or > data pool) should match your raid controller (or GNR vdisk) stripe size of > the particular pool. > > so if you use a 128k strip size(defaut in many midrange storage > controllers) in a 8+2p raid array, your stripe or track size is 1 MB and > therefore the blocksize of this pool should be 1 MB. i see many customers > in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or > above and your performance will be significant impacted by that. > > Sven > > -- > Sven Oehme > Scalable Storage Research > email: oeh...@us.ibm.com > Phone: +1 (408) 824-8904 > IBM Almaden Research Lab > -- > > Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too > pedantic, but I believe the the subblock size is 1/32 of the block size > (which strengt > > > > From: Stephen Ulmer > To: gpfsug main discussion list > Date: 09/23/2016 12:16 PM > Subject: Re: [gpfsug-discuss] Blocksize > Sent by: gpfsug-discuss-boun...@spectrumscale.org > > -- > > > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the > block size (which strengthens Luis’s arguments below). > > I thought the the original question was NOT about inode size, but a
Re: [gpfsug-discuss] Blocksize - consider IO transfer efficiency above your other prejudices
(I can answer your basic questions, Sven has more experience with tuning very large file systems, so perhaps he will have more to say...) 1. Inodes are packed into the file of inodes. (There is one file of all the inodes in a filesystem). If you have metadata-blocksize 1MB you will have 256 of 4KB inodes per block. Forget about sub-blocks when it comes to the file of inodes. 2. IF a file's data fits in its inode, then migrating that file from one pool to another just changes the preferred pool name in the inode. No data movement. Should the file later "grow" to require a data block, that data block will be allocated from whatever pool is named in the inode at that time. See the email I posted earlier today. Basically: FORGET what you thought you knew about optimal metadata blocksize (perhaps based on how you thought metadata was laid out on disk) and just stick to optimal IO transfer blocksizes. Yes, there may be contrived scenarios or even a few real live special cases, but those would be few and far between. Try following the newer general, easier, rule and see how well it works. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/24/2016 10:19 AM Subject: Re: [gpfsug-discuss] Blocksize Sent by:gpfsug-discuss-boun...@spectrumscale.org Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn’t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here’s a related question … let’s just say I have the above configuration … my system pool is metadata only and is on SSD’s. Then I have two other dataOnly pools that are spinning disk. One is for “regular” access and the other is the “capacity” pool … i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like “move all files with an access time > 6 months to the capacity pool.” Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven -- Sven Oehme Scalable Storage Research email: oeh...@us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab -- Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-boun...@spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis’s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (—metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we’d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculat
Re: [gpfsug-discuss] Blocksize and MetaData Blocksizes - FORGET the old advice
Metadata is inodes, directories, indirect blocks (indices). Spectrum Scale (GPFS) Version 4.1 introduced significant improvements to the data structures used to represent directories. Larger inodes supporting data and extended attributes in the inode are other significant relatively recent improvements. Now small directories are stored in the inode, while for large directories blocks can be bigger than 32MB, and any and all directory blocks that are smaller than the metadata-blocksize, are allocated just like "fragments" - so directories are now space efficient. SO MUCH SO, that THE OLD ADVICE, about using smallish blocksizes for metadata, GOES "OUT THE WINDOW". Period. FORGET most of what you thought you knew about "best" or "optimal" metadata-blocksize. The new advice is, as Sven wrote: Use a blocksize that optimizes IO transfer efficiency and speed. This is true for BOTH data and metadata. Now, IF you have system pool set up as metadata only AND system pool is on devices that have a different "optimal" block size than your other pools, THEN, it may make sense to use two different blocksizes, one for data and another for metadata. For example, maybe you have massively striped RAID or RAID-LIKE (GSS or ESS)) storage for huge files - so maybe 8MB is a good blocksize for that. But maybe you have your metadata on SSD devices and maybe 1MB is the "best" blocksize for that. ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes. If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn’t you be wasting a massive amount of space? What am I missing / confused about there? Oh, and here’s a related question … let’s just say I have the above configuration … my system pool is metadata only and is on SSD’s. Then I have two other dataOnly pools that are spinning disk. One is for “regular” access and the other is the “capacity” pool … i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like “move all files with an access time > 6 months to the capacity pool.” Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion () of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode?? Thanks! This is a very timely and interesting discussion for me as well... Kevin On Sep 23, 2016, at 4:35 PM, Sven Oehme mailto:oeh...@us.ibm.com>> wrote: your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven -- Sven Oehme Scalable Storage Research email: oeh...@us.ibm.com<mailto:oeh...@us.ibm.com> Phone: +1 (408) 824-8904 IBM Almaden Research Lab -- Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt From: Stephen Ulmer mailto:ul...@ulmer.org>> To: gpfsug main discussion list mailto:gpfsug-discuss@spectrumscale.org>> Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by: gpfsug-discuss-boun...@spectrumscale.org<mailto:gpfsug-discuss-boun...@spectrumscale.org> Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis’s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (—metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we’d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you’re lucky. There could be a great reason NOT to use 128K metadata block size, but I don’t know what it is. I’d be happy to be corrected about this if it’s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches mailto:luis.bolinc...@fi.ibm.com>> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile .
Re: [gpfsug-discuss] Blocksize
Not pendant but correct I flip there it is 1/32 -- Cheers > On 23 Sep 2016, at 22.16, Stephen Ulmer wrote: > > Not to be too pedantic, but I believe the the subblock size is 1/32 of the > block size (which strengthens Luis’s arguments below). > > I thought the the original question was NOT about inode size, but about > metadata block size. You can specify that the system pool have a different > block size from the rest of the filesystem, providing that it ONLY holds > metadata (—metadata-block-size option to mmcrfs). > > So with 4K inodes (which should be used for all new filesystems without some > counter-indication), I would think that we’d want to use a metadata block > size of 4K*32=128K. This is independent of the regular block size, which you > can calculate based on the workload if you’re lucky. > > There could be a great reason NOT to use 128K metadata block size, but I > don’t know what it is. I’d be happy to be corrected about this if it’s out of > whack. > > -- > Stephen > > > >> On Sep 22, 2016, at 3:37 PM, Luis Bolinches >> wrote: >> >> Hi >> >> My 2 cents. >> >> Leave at least 4K inodes, then you get massive improvement on small files >> (less 3.5K minus whatever you use on xattr) >> >> About blocksize for data, unless you have actual data that suggest that you >> will actually benefit from smaller than 1MB block, leave there. GPFS uses >> sublocks where 1/16th of the BS can be allocated to different files, so the >> "waste" is much less than you think on 1MB and you get the throughput and >> less structures of much more data blocks. >> >> No warranty at all but I try to do this when the BS talk comes in: (might >> need some clean up it could not be last note but you get the idea) >> >> POSIX >> find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out >> GPFS >> cd /usr/lpp/mmfs/samples/ilm >> gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile >> ./mmfind /gpfs/shared -ls -type f > find_ls_files.out >> CONVERT to CSV >> >> POSIX >> cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv >> GPFS >> cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv >> LOAD in octave >> >> FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); >> Clean the second column (OPTIONAL as the next clean up will do the same) >> >> FILESIZE(:,[2]) = []; >> If we are on 4K aligment we need to clean the files that go to inodes >> (WELL not exactly ... extended attributes! so maybe use a lower number!) >> >> FILESIZE(FILESIZE<=3584) =[]; >> If we are not we need to clean the 0 size files >> >> FILESIZE(FILESIZE==0) =[]; >> Median >> >> FILESIZEMEDIAN = int32 (median (FILESIZE)) >> Mean >> >> FILESIZEMEAN = int32 (mean (FILESIZE)) >> Variance >> >> int32 (var (FILESIZE)) >> iqr interquartile range, i.e., the difference between the upper and >> lower quartile, of the input data. >> >> int32 (iqr (FILESIZE)) >> Standard deviation >> >> >> For some FS with lots of files you might need a rather powerful machine to >> run the calculations on octave, I never hit anything could not manage on a >> 64GB RAM Power box. Most of the times it is enough with my laptop. >> >> >> >> -- >> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations >> >> Luis Bolinches >> Lab Services >> http://www-03.ibm.com/systems/services/labservices/ >> >> IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland >> Phone: +358 503112585 >> >> "If you continually give you will continually have." Anonymous >> >> >> - Original message - >> From: Stef Coene >> Sent by: gpfsug-discuss-boun...@spectrumscale.org >> To: gpfsug main discussion list >> Cc: >> Subject: Re: [gpfsug-discuss] Blocksize >> Date: Thu, Sep 22, 2016 10:30 PM >> >> On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: >> > It defaults to 4k: >> > mmlsfs testbs8M -i >> > flagvaluedescription >> > --- >> > --- >> > -i 4096 Inode size in bytes >> > >> > I think you can make as small as 512b. G
Re: [gpfsug-discuss] Blocksize
To keep this great chain going: If my metadata is on FLASH, would having a smaller blocksize for the system pool (metadata only) be helpful. My filesystem blocksize is 8MB On Fri, Sep 23, 2016 at 6:34 PM, Stef Coene wrote: > On 09/22/2016 08:36 PM, Stef Coene wrote: > >> Hi, >> >> Is it needed to specify a different blocksize for the system pool that >> holds the metadata? >> >> IBM recommends a 1 MB blocksize for the file system. >> But I wonder a smaller blocksize (256 KB or so) for metadata is a good >> idea or not... >> > I have read the replies and at the end, this is what we will do: > Since the back-end storage will be V5000 with a default stripe size of > 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is > the best choice for block size. > So 2 MB block size for data is the best choice. > > Since the block size for metadata is not that important in the latest > releases, we will also go for 2 MB block size for metadata. > > Inode size will be left at the default: 4 KB. > > > > Stef > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
On 09/22/2016 08:36 PM, Stef Coene wrote: Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... I have read the replies and at the end, this is what we will do: Since the back-end storage will be V5000 with a default stripe size of 256KB and we use 8 data disk in an array, this means that 256KB * 8 = 2M is the best choice for block size. So 2 MB block size for data is the best choice. Since the block size for metadata is not that important in the latest releases, we will also go for 2 MB block size for metadata. Inode size will be left at the default: 4 KB. Stef ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point. the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool. so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that. Sven -- Sven Oehme Scalable Storage Research email: oeh...@us.ibm.com Phone: +1 (408) 824-8904 IBM Almaden Research Lab -- From: Stephen Ulmer To: gpfsug main discussion list Date: 09/23/2016 12:16 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by:gpfsug-discuss-boun...@spectrumscale.org Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis’s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (—metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we’d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you’re lucky. There could be a great reason NOT to use 128K metadata block size, but I don’t know what it is. I’d be happy to be corrected about this if it’s out of whack. -- Stephen On Sep 22, 2016, at 3:37 PM, Luis Bolinches < luis.bolinc...@fi.ibm.com> wrote: Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIX find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFS cd /usr/lpp/mmfs/samples/ilm gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile ./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIX cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv GPFS cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. -- Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations Luis Bolinches Lab Services http://www-03.ibm.com/systems/services/labservices/ IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland Phone: +358 503112585 "If you continually give you will continually have." Anonymous - Original message - From: Stef Coene Sent by: gpfsug-discuss-boun...@spectrumscale.org To: gpfsug main discussion list
Re: [gpfsug-discuss] Blocksize
Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis’s arguments below). I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (—metadata-block-size option to mmcrfs). So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we’d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you’re lucky. There could be a great reason NOT to use 128K metadata block size, but I don’t know what it is. I’d be happy to be corrected about this if it’s out of whack. -- Stephen > On Sep 22, 2016, at 3:37 PM, Luis Bolinches wrote: > > Hi > > My 2 cents. > > Leave at least 4K inodes, then you get massive improvement on small files > (less 3.5K minus whatever you use on xattr) > > About blocksize for data, unless you have actual data that suggest that you > will actually benefit from smaller than 1MB block, leave there. GPFS uses > sublocks where 1/16th of the BS can be allocated to different files, so the > "waste" is much less than you think on 1MB and you get the throughput and > less structures of much more data blocks. > > No warranty at all but I try to do this when the BS talk comes in: (might > need some clean up it could not be last note but you get the idea) > > POSIX > find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out > GPFS > cd /usr/lpp/mmfs/samples/ilm > gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile > ./mmfind /gpfs/shared -ls -type f > find_ls_files.out > CONVERT to CSV > > POSIX > cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv > GPFS > cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv > LOAD in octave > > FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); > Clean the second column (OPTIONAL as the next clean up will do the same) > > FILESIZE(:,[2]) = []; > If we are on 4K aligment we need to clean the files that go to inodes > (WELL not exactly ... extended attributes! so maybe use a lower number!) > > FILESIZE(FILESIZE<=3584) =[]; > If we are not we need to clean the 0 size files > > FILESIZE(FILESIZE==0) =[]; > Median > > FILESIZEMEDIAN = int32 (median (FILESIZE)) > Mean > > FILESIZEMEAN = int32 (mean (FILESIZE)) > Variance > > int32 (var (FILESIZE)) > iqr interquartile range, i.e., the difference between the upper and lower > quartile, of the input data. > > int32 (iqr (FILESIZE)) > Standard deviation > > > For some FS with lots of files you might need a rather powerful machine to > run the calculations on octave, I never hit anything could not manage on a > 64GB RAM Power box. Most of the times it is enough with my laptop. > > > > -- > Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations > > Luis Bolinches > Lab Services > http://www-03.ibm.com/systems/services/labservices/ > > IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland > Phone: +358 503112585 > > "If you continually give you will continually have." Anonymous > > > - Original message - > From: Stef Coene > Sent by: gpfsug-discuss-boun...@spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Blocksize > Date: Thu, Sep 22, 2016 10:30 PM > > On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: > > It defaults to 4k: > > mmlsfs testbs8M -i > > flagvaluedescription > > --- > > --- > > -i 4096 Inode size in bytes > > > > I think you can make as small as 512b. Gpfs will store very small > > files in the inode. > > > > Typically you want your average file size to be your blocksize and your > > filesystem has one blocksize and one inodesize. > > The files are not small, but around 20 MB on average. > So I calculated with IBM that a 1 MB or 2 MB block size is best. > > But I'm not sure if it's better to use a smaller block size for the > metadata. > > The file system is not that large (400 TB) and will hold backup data > from CommVault. > > > Stef > ___ &g
Re: [gpfsug-discuss] Blocksize
The current (V4.2+) levels of code support bigger directory block sizes, so it's no longer an issue with something like 1M metadata block size. In fact, there isn't a whole lot of difference between 256K and 1M metadata block sizes, either would work fine. There isn't really a downside in selecting a different block size for metadata though. Inode size (mmcrfs -i option) is orthogonal to the metadata block size selection. We do strongly recommend using 4K inodes to anyone. There's the obvious downside of needing more metadata storage for inodes, but the advantages are significant. yuri From: Jan-Frode Myklebust To: gpfsug main discussion list , Date: 09/22/2016 12:25 PM Subject: Re: [gpfsug-discuss] Blocksize Sent by:gpfsug-discuss-boun...@spectrumscale.org https://www.ibm.com/developerworks/community/forums/html/topic?id=----14774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize and space and performance for Metadata, release 4.2.x
There have been a few changes over the years that may invalidate some of the old advice about metadata and disk allocations there for. These have been phased in over the last few years, I am discussing the present situation for release 4.2.x 1) Inode size. Used to be 512. Now you can set the inodesize at mmcrfs time. Defaults to 4096. 2) Data in inode. If it fits, then the inode holds the data. Since a 512 byte inode still works, you can have more than 3.5KB of data in a 4KB inode. 3) Extended Attributes in Inode. Again, if it fits... Extended attributes used to be stored in a separate file of metadata. So extended attributes performance is way better than the old days. 4) (small) Directories in Inode. If it fits, the inode of a directory can hold the directory entries. That gives you about 2x performance on directory reads, for smallish directories. 5) Big directory blocks. Directories used to use a maximum of 32KB per block, potentially wasting a lot of space and yielding poor performance for large directories. Now directory blocks are the lesser of metadata-blocksize and 256KB. 6) Big directories are shrinkable. Used to be directories would grow in 32KB chunks but never shrink. Yup, even an almost(?) "empty" directory would remain the size the directory had to be at its lifetime maximum. That means just a few remaining entries could be "sprinkled" over many directory blocks. (See also 5.) But now directories autoshrink to avoid wasteful sparsity. Last I looked, the implementation just stopped short of "pushing" tiny directories back into the inode. But a huge directory can be shrunk down to a single (meta)data block. (See --compact in the docs.) --marc of GPFS ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
Hi My 2 cents. Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr) About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks. No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea) POSIXfind . -type f -name '*' -exec ls -l {} \; > find_ls_files.out GPFScd /usr/lpp/mmfs/samples/ilmgcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile./mmfind /gpfs/shared -ls -type f > find_ls_files.out CONVERT to CSV POSIXcat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csvGPFScat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv LOAD in octave FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ",")); Clean the second column (OPTIONAL as the next clean up will do the same) FILESIZE(:,[2]) = []; If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!) FILESIZE(FILESIZE<=3584) =[]; If we are not we need to clean the 0 size files FILESIZE(FILESIZE==0) =[]; Median FILESIZEMEDIAN = int32 (median (FILESIZE)) Mean FILESIZEMEAN = int32 (mean (FILESIZE)) Variance int32 (var (FILESIZE)) iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data. int32 (iqr (FILESIZE)) Standard deviation For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop. --Ystävällisin terveisin / Kind regards / Saludos cordiales / SalutationsLuis BolinchesLab Serviceshttp://www-03.ibm.com/systems/services/labservices/IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 FinlandPhone: +358 503112585"If you continually give you will continually have." Anonymous - Original message -From: Stef Coene Sent by: gpfsug-discuss-boun...@spectrumscale.orgTo: gpfsug main discussion list Cc:Subject: Re: [gpfsug-discuss] BlocksizeDate: Thu, Sep 22, 2016 10:30 PM On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:> It defaults to 4k:> mmlsfs testbs8M -i> flag value description> --- > ---> -i 4096 Inode size in bytes>> I think you can make as small as 512b. Gpfs will store very small> files in the inode.>> Typically you want your average file size to be your blocksize and your> filesystem has one blocksize and one inodesize.The files are not small, but around 20 MB on average.So I calculated with IBM that a 1 MB or 2 MB block size is best.But I'm not sure if it's better to use a smaller block size for themetadata.The file system is not that large (400 TB) and will hold backup datafrom CommVault.Stef___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edellä ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
On 09/22/2016 09:07 PM, J. Eric Wonderley wrote: It defaults to 4k: mmlsfs testbs8M -i flagvaluedescription --- --- -i 4096 Inode size in bytes I think you can make as small as 512b. Gpfs will store very small files in the inode. Typically you want your average file size to be your blocksize and your filesystem has one blocksize and one inodesize. The files are not small, but around 20 MB on average. So I calculated with IBM that a 1 MB or 2 MB block size is best. But I'm not sure if it's better to use a smaller block size for the metadata. The file system is not that large (400 TB) and will hold backup data from CommVault. Stef ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
https://www.ibm.com/developerworks/community/forums/html/topic?id=----14774266 "Use 256K. Anything smaller makes allocation blocks for the inode file inefficient. Anything larger wastes space for directories. These are the two largest consumers of metadata space." --dlmcnabb A bit old, but I would assume it still applies. -jf On Thu, Sep 22, 2016 at 8:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
This is a great idea. However there are quite a few other things to consider: -max file count? If you need say a couple of billion files, this will affect things. -wish to store small files in the system pool in late model SS/GPFS? -encryption? No data will be stored in the system pool so large blocks for small file storage in system is pointless. -system pool replication? -HDD vs SSD for system pool? -xxD or array tuning recommendations from your vendor? -streaming vs random IO? Do you have a single dedicated app that has performance like xxx? -probably more I can't think of off the top of my head. etc etc Ed From: gpfsug-discuss-boun...@spectrumscale.org [gpfsug-discuss-boun...@spectrumscale.org] on behalf of Stef Coene [stef.co...@docum.org] Sent: Thursday, September 22, 2016 2:36 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Blocksize Hi, Is it needed to specify a different blocksize for the system pool that holds the metadata? IBM recommends a 1 MB blocksize for the file system. But I wonder a smaller blocksize (256 KB or so) for metadata is a good idea or not... Stef ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Re: [gpfsug-discuss] Blocksize
It defaults to 4k: mmlsfs testbs8M -i flagvaluedescription --- --- -i 4096 Inode size in bytes I think you can make as small as 512b. Gpfs will store very small files in the inode. Typically you want your average file size to be your blocksize and your filesystem has one blocksize and one inodesize. On Thu, Sep 22, 2016 at 2:36 PM, Stef Coene wrote: > Hi, > > Is it needed to specify a different blocksize for the system pool that > holds the metadata? > > IBM recommends a 1 MB blocksize for the file system. > But I wonder a smaller blocksize (256 KB or so) for metadata is a good > idea or not... > > > Stef > ___ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > ___ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss