Re: [Cluster-devel] [PATCH v2] mkfs.gfs2: Scale down journal size for smaller devices
Hi Bob, On 14/02/18 14:13, Bob Peterson wrote: Hi, Comments below. - Original Message - | Currently the default behaviour when the journal size is not specified | is to use a default size of 128M, which means that mkfs.gfs2 can run out | of space while writing to a small device. The hard default also means | that some xfstests fail with gfs2 as they try to create small file | systems. | | This patch addresses these problems by setting sensible default journal | sizes depending on the size of the file system. Journal sizes specified | by the user are limited to half of the fs. As the minimum journal size | is 8MB that means we effectively get a hard minimum file system size of | 16MB (per journal). | | Signed-off-by: Andrew Price| --- | | v2: Andreas found that using 25% of the fs for journals was too large so this | version separates the default journal size calculation from the check | used | for user-provided journal sizes, which allows for more sensible defaults. | The default journal sizes for fs size ranges were taken from e2fsprogs. | | gfs2/libgfs2/libgfs2.h | 2 ++ | gfs2/man/mkfs.gfs2.8 | 5 +++-- | gfs2/mkfs/main_mkfs.c | 56 | -- | tests/edit.at | 2 +- | tests/mkfs.at | 10 + | tests/testsuite.at | 6 ++ | 6 files changed, 76 insertions(+), 5 deletions(-) (snip) | + if (num_blocks < 8192*1024) /* 32 GB */ | + return (32768); /* 128 MB */ | + if (num_blocks < 16384*1024)/* 64 GB */ | + return (65536); /* 256 MB */ | + if (num_blocks < 32768*1024)/* 128 GB */ | + return (131072);/* 512 MB */ | + return 262144; /* 1 GB */ Perhaps you can adjust the indentation on the comment so it's clear that the journal size is 1GB in this case, not the file system size? The journal size comments are already aligned but I guess I could nudge the "1 GB" over a little :) Here are some random thoughts on the matter: I'm not sure I like the default journal size going up so quickly at 32GB. In most cases, 128MB journals should be adequate. I'd like to see a much higher threshold that still uses 128MB journals. Unless there's a high level of metadata pressure, after a certain point, it's just wasted space. I'd rather see 128MB journals go up to file systems of 1TB, for example. I'm not sure it's ever worthwhile to use a 1GB journal, but I suppose with today's faster storage and faster machines, maybe it would be. Barry recently got some new super-fast storage; perhaps we should ask him to test some metadata-intense benchmark to see if we can ever push it to the point of waiting for journal writes. I'd use instrumentation to tell us whenever journal writes need to wait for journal space. Of course, a lot of that hinges on the bug I'm currently working on where we often artificially wait too long for journal space. (IOW, this is less of a concern when I get the bug fixed). Good points. It would be useful to see some performance numbers with different journal/device sizes. For now, based on your comments, perhaps we can do something like fs sizejsize (at 4K blocks) < 512M 8M < 2G 16M < 8G 32M < 16G 64M < 1T 128M < 10T 512M >= 10T 1G So we get the current default of 128M journals between 16G and 1T, and we keep the lower values the same to cater for Andreas' test cases. Over 1T a gigabyte is not much wasted space so we might as well increase it to the max. Also, don't forget that GFS2, unlike other file systems, requires a journal for each node, and that should also be factored into the calculations. Yes, the changes added in sbd_init() that do a '/ opts->journals' take the number of journals into account. Don't forget also that at a certain size, GFS2 journals will can cross resource group boundaries, For a while that's only been true for journals added with gfs2_jadd. mkfs.gfs2 always creates single-extent journals. and therefore have multiple segments to manage. It may not be a big deal to carve out a 1GB journal when the file system is shiny and new, but after two years of use, the file system may be severely fragmented, so gfs2_jadd may add journals that are severely fragmented, especially if they're big. Adding a 128MB journal is less likely to get into fragmentation concerns than a 1GB journal. Writing to a fragmented journal then becomes a slow-down because the journal extent map needed to reference it becomes complex, and it's used for every journal block written. All good points to consider. I haven't touched gfs2_jadd yet but perhaps it would be better to leave it as-is in that case. That said, we should be fallocate()ing new journals and fallocate() should be doing its best to avoid fragmentation, although I accept it won't always
Re: [Cluster-devel] [PATCH v2] mkfs.gfs2: Scale down journal size for smaller devices
Hi, Comments below. - Original Message - | Currently the default behaviour when the journal size is not specified | is to use a default size of 128M, which means that mkfs.gfs2 can run out | of space while writing to a small device. The hard default also means | that some xfstests fail with gfs2 as they try to create small file | systems. | | This patch addresses these problems by setting sensible default journal | sizes depending on the size of the file system. Journal sizes specified | by the user are limited to half of the fs. As the minimum journal size | is 8MB that means we effectively get a hard minimum file system size of | 16MB (per journal). | | Signed-off-by: Andrew Price| --- | | v2: Andreas found that using 25% of the fs for journals was too large so this | version separates the default journal size calculation from the check | used | for user-provided journal sizes, which allows for more sensible defaults. | The default journal sizes for fs size ranges were taken from e2fsprogs. | | gfs2/libgfs2/libgfs2.h | 2 ++ | gfs2/man/mkfs.gfs2.8 | 5 +++-- | gfs2/mkfs/main_mkfs.c | 56 | -- | tests/edit.at | 2 +- | tests/mkfs.at | 10 + | tests/testsuite.at | 6 ++ | 6 files changed, 76 insertions(+), 5 deletions(-) (snip) | + if (num_blocks < 8192*1024) /* 32 GB */ | + return (32768); /* 128 MB */ | + if (num_blocks < 16384*1024)/* 64 GB */ | + return (65536); /* 256 MB */ | + if (num_blocks < 32768*1024)/* 128 GB */ | + return (131072);/* 512 MB */ | + return 262144; /* 1 GB */ Perhaps you can adjust the indentation on the comment so it's clear that the journal size is 1GB in this case, not the file system size? Here are some random thoughts on the matter: I'm not sure I like the default journal size going up so quickly at 32GB. In most cases, 128MB journals should be adequate. I'd like to see a much higher threshold that still uses 128MB journals. Unless there's a high level of metadata pressure, after a certain point, it's just wasted space. I'd rather see 128MB journals go up to file systems of 1TB, for example. I'm not sure it's ever worthwhile to use a 1GB journal, but I suppose with today's faster storage and faster machines, maybe it would be. Barry recently got some new super-fast storage; perhaps we should ask him to test some metadata-intense benchmark to see if we can ever push it to the point of waiting for journal writes. I'd use instrumentation to tell us whenever journal writes need to wait for journal space. Of course, a lot of that hinges on the bug I'm currently working on where we often artificially wait too long for journal space. (IOW, this is less of a concern when I get the bug fixed). Also, don't forget that GFS2, unlike other file systems, requires a journal for each node, and that should also be factored into the calculations. So if you have 1TB file system and it chooses a journal size of 1GB, but it's a 16-node cluster, you're using 16GB of space for the journals. That's maybe not a tragedy, but it's likely to not give them any performance benefit either. Unless they need jdata, for example, which is heavy on journal-writes. Don't forget also that at a certain size, GFS2 journals will can cross resource group boundaries, and therefore have multiple segments to manage. It may not be a big deal to carve out a 1GB journal when the file system is shiny and new, but after two years of use, the file system may be severely fragmented, so gfs2_jadd may add journals that are severely fragmented, especially if they're big. Adding a 128MB journal is less likely to get into fragmentation concerns than a 1GB journal. Writing to a fragmented journal then becomes a slow-down because the journal extent map needed to reference it becomes complex, and it's used for every journal block written. Regards Bob Peterson Red Hat File Systems
[Cluster-devel] [PATCH v2] mkfs.gfs2: Scale down journal size for smaller devices
Currently the default behaviour when the journal size is not specified is to use a default size of 128M, which means that mkfs.gfs2 can run out of space while writing to a small device. The hard default also means that some xfstests fail with gfs2 as they try to create small file systems. This patch addresses these problems by setting sensible default journal sizes depending on the size of the file system. Journal sizes specified by the user are limited to half of the fs. As the minimum journal size is 8MB that means we effectively get a hard minimum file system size of 16MB (per journal). Signed-off-by: Andrew Price--- v2: Andreas found that using 25% of the fs for journals was too large so this version separates the default journal size calculation from the check used for user-provided journal sizes, which allows for more sensible defaults. The default journal sizes for fs size ranges were taken from e2fsprogs. gfs2/libgfs2/libgfs2.h | 2 ++ gfs2/man/mkfs.gfs2.8 | 5 +++-- gfs2/mkfs/main_mkfs.c | 56 -- tests/edit.at | 2 +- tests/mkfs.at | 10 + tests/testsuite.at | 6 ++ 6 files changed, 76 insertions(+), 5 deletions(-) diff --git a/gfs2/libgfs2/libgfs2.h b/gfs2/libgfs2/libgfs2.h index 85ac74cb..15d2a9d1 100644 --- a/gfs2/libgfs2/libgfs2.h +++ b/gfs2/libgfs2/libgfs2.h @@ -319,6 +319,8 @@ struct metapath { #define GFS2_DEFAULT_BSIZE (4096) #define GFS2_DEFAULT_JSIZE (128) +#define GFS2_MAX_JSIZE (1024) +#define GFS2_MIN_JSIZE (8) #define GFS2_DEFAULT_RGSIZE (256) #define GFS2_DEFAULT_UTSIZE (1) #define GFS2_DEFAULT_QCSIZE (1) diff --git a/gfs2/man/mkfs.gfs2.8 b/gfs2/man/mkfs.gfs2.8 index 342a636d..35e355a5 100644 --- a/gfs2/man/mkfs.gfs2.8 +++ b/gfs2/man/mkfs.gfs2.8 @@ -32,8 +32,9 @@ Enable debugging output. Print out a help message describing the available options, then exit. .TP \fB-J\fP \fImegabytes\fR -The size of each journal. The default journal size is 128 megabytes and the -minimum size is 8 megabytes. +The size of each journal. The minimum size is 8 megabytes and the maximum is +1024. If this is not specified, a value based on a sensible proportion of the +file system will be chosen. .TP \fB-j\fP \fIjournals\fR The number of journals for mkfs.gfs2 to create. At least one journal is diff --git a/gfs2/mkfs/main_mkfs.c b/gfs2/mkfs/main_mkfs.c index 54ff2db6..dda9dab3 100644 --- a/gfs2/mkfs/main_mkfs.c +++ b/gfs2/mkfs/main_mkfs.c @@ -552,7 +552,7 @@ static void opts_check(struct mkfs_opts *opts) if (!opts->journals) die( _("no journals specified\n")); - if (opts->jsize < 8 || opts->jsize > 1024) + if (opts->jsize < GFS2_MIN_JSIZE || opts->jsize > GFS2_MAX_JSIZE) die( _("bad journal size\n")); if (!opts->qcsize || opts->qcsize > 64) @@ -575,6 +575,7 @@ static void print_results(struct gfs2_sb *sb, struct mkfs_opts *opts, uint64_t r printf("%-27s%.2f %s (%"PRIu64" %s)\n", _("Filesystem size:"), (fssize / ((float)(1 << 30)) * sb->sb_bsize), _("GB"), fssize, _("blocks")); printf("%-27s%u\n", _("Journals:"), opts->journals); + printf("%-27s%uMB\n", _("Journal size:"), opts->jsize); printf("%-27s%"PRIu64"\n", _("Resource groups:"), rgrps); printf("%-27s\"%s\"\n", _("Locking protocol:"), opts->lockproto); printf("%-27s\"%s\"\n", _("Lock table:"), opts->locktable); @@ -814,6 +815,38 @@ static int place_rgrps(struct gfs2_sbd *sdp, lgfs2_rgrps_t rgs, uint64_t *rgaddr return 0; } +/* + * Find a reasonable journal file size (in blocks) given the number of blocks + * in the filesystem. For very small filesystems, it is not reasonable to + * have a journal that fills more than half of the filesystem. + * + * n.b. comments assume 4k blocks + * + * This was copied and adapted from e2fsprogs. + */ +static int default_journal_size(unsigned bsize, uint64_t num_blocks) +{ + int min_blocks = (GFS2_MIN_JSIZE << 20) / bsize; + + if (num_blocks < 2 * min_blocks) + return -1; + if (num_blocks < 32768) /* 128 MB */ + return min_blocks; /* 8 MB */ + if (num_blocks < 256*1024) /* 1 GB */ + return (4096); /* 16 MB */ + if (num_blocks < 512*1024) /* 2 GB */ + return (8192); /* 32 MB */ + if (num_blocks < 4096*1024) /* 16 GB */ + return (16384); /* 64 MB */ + if (num_blocks < 8192*1024) /* 32 GB */ + return (32768); /* 128 MB */ + if (num_blocks < 16384*1024)/* 64 GB */ + return (65536); /* 256 MB */ + if (num_blocks < 32768*1024)/* 128 GB */ + return (131072);/* 512 MB */ +