Re: [Cluster-devel] [PATCH v2] mkfs.gfs2: Scale down journal size for smaller devices

2018-02-14 Thread Andrew Price

Hi Bob,

On 14/02/18 14:13, Bob Peterson wrote:

Hi,

Comments below.

- Original Message -
| Currently the default behaviour when the journal size is not specified
| is to use a default size of 128M, which means that mkfs.gfs2 can run out
| of space while writing to a small device. The hard default also means
| that some xfstests fail with gfs2 as they try to create small file
| systems.
|
| This patch addresses these problems by setting sensible default journal
| sizes depending on the size of the file system. Journal sizes specified
| by the user are limited to half of the fs. As the minimum journal size
| is 8MB that means we effectively get a hard minimum file system size of
| 16MB (per journal).
|
| Signed-off-by: Andrew Price 
| ---
|
| v2: Andreas found that using 25% of the fs for journals was too large so this
| version separates the default journal size calculation from the check
| used
| for user-provided journal sizes, which allows for more sensible defaults.
| The default journal sizes for fs size ranges were taken from e2fsprogs.
|
|  gfs2/libgfs2/libgfs2.h |  2 ++
|  gfs2/man/mkfs.gfs2.8   |  5 +++--
|  gfs2/mkfs/main_mkfs.c  | 56
|  --
|  tests/edit.at  |  2 +-
|  tests/mkfs.at  | 10 +
|  tests/testsuite.at |  6 ++
|  6 files changed, 76 insertions(+), 5 deletions(-)
(snip)
| + if (num_blocks < 8192*1024) /* 32 GB */
| + return (32768); /* 128 MB */
| + if (num_blocks < 16384*1024)/* 64 GB */
| + return (65536); /* 256 MB */
| + if (num_blocks < 32768*1024)/* 128 GB */
| + return (131072);/* 512 MB */
| + return 262144;  /* 1 GB */

Perhaps you can adjust the indentation on the comment so it's clear
that the journal size is 1GB in this case, not the file system size?


The journal size comments are already aligned but I guess I could nudge 
the "1 GB" over a little :)



Here are some random thoughts on the matter:

I'm not sure I like the default journal size going up so quickly at
32GB. In most cases, 128MB journals should be adequate. I'd like to
see a much higher threshold that still uses 128MB journals.
Unless there's a high level of metadata pressure, after a certain point,
it's just wasted space.

I'd rather see 128MB journals go up to file systems of 1TB, for example.
I'm not sure it's ever worthwhile to use a 1GB journal, but I suppose
with today's faster storage and faster machines, maybe it would be.
Barry recently got some new super-fast storage; perhaps we should ask
him to test some metadata-intense benchmark to see if we can ever
push it to the point of waiting for journal writes. I'd use
instrumentation to tell us whenever journal writes need to wait for
journal space. Of course, a lot of that hinges on the bug I'm currently
working on where we often artificially wait too long for journal space.
(IOW, this is less of a concern when I get the bug fixed).


Good points. It would be useful to see some performance numbers with 
different journal/device sizes. For now, based on your comments, perhaps 
we can do something like


  fs sizejsize (at 4K blocks)
   < 512M   8M
   <   2G  16M
   <   8G  32M
   <  16G  64M
   <   1T 128M
   <  10T 512M
   >= 10T   1G

So we get the current default of 128M journals between 16G and 1T, and 
we keep the lower values the same to cater for Andreas' test cases. Over 
1T a gigabyte is not much wasted space so we might as well increase it 
to the max.



Also, don't forget that GFS2, unlike other file systems, requires a
journal for each node, and that should also be factored into the
calculations.


Yes, the changes added in sbd_init() that do a '/ opts->journals' take 
the number of journals into account.



Don't forget also that at a certain size, GFS2 journals will can cross
resource group boundaries,


For a while that's only been true for journals added with gfs2_jadd. 
mkfs.gfs2 always creates single-extent journals.



and therefore have multiple segments to
manage. It may not be a big deal to carve out a 1GB journal when the
file system is shiny and new, but after two years of use, the file system
may be severely fragmented, so gfs2_jadd may add journals that are
severely fragmented, especially if they're big. Adding a 128MB journal
is less likely to get into fragmentation concerns than a 1GB journal.
Writing to a fragmented journal then becomes a slow-down because the
journal extent map needed to reference it becomes complex, and it's
used for every journal block written.
All good points to consider. I haven't touched gfs2_jadd yet but perhaps 
it would be better to leave it as-is in that case. That said, we should 
be fallocate()ing new journals and fallocate() should be doing its best 
to avoid fragmentation, although I accept it won't always be able to.


Thanks

[Cluster-devel] GFS2: Only set PageChecked for jdata pages

2018-02-14 Thread Bob Peterson
Hi,

Before this patch, GFS2 was setting the PageChecked flag for ordered
write pages. This is unnecessary. The ext3 file system only does it
for jdata, and it's only used in jdata circumstances. It only muddies
the already murky waters of writing pages in the aops.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/aops.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 2f725b4a386b..f58716567972 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -940,13 +940,13 @@ static int gfs2_write_end(struct file *file, struct 
address_space *mapping,
 }
 
 /**
- * gfs2_set_page_dirty - Page dirtying function
+ * jdata_set_page_dirty - Page dirtying function
  * @page: The page to dirty
  *
  * Returns: 1 if it dirtyed the page, or 0 otherwise
  */
  
-static int gfs2_set_page_dirty(struct page *page)
+static int jdata_set_page_dirty(struct page *page)
 {
SetPageChecked(page);
return __set_page_dirty_buffers(page);
@@ -1214,7 +1214,7 @@ static const struct address_space_operations 
gfs2_ordered_aops = {
.readpages = gfs2_readpages,
.write_begin = gfs2_write_begin,
.write_end = gfs2_write_end,
-   .set_page_dirty = gfs2_set_page_dirty,
+   .set_page_dirty = __set_page_dirty_buffers,
.bmap = gfs2_bmap,
.invalidatepage = gfs2_invalidatepage,
.releasepage = gfs2_releasepage,
@@ -1231,7 +1231,7 @@ static const struct address_space_operations 
gfs2_jdata_aops = {
.readpages = gfs2_readpages,
.write_begin = gfs2_write_begin,
.write_end = gfs2_write_end,
-   .set_page_dirty = gfs2_set_page_dirty,
+   .set_page_dirty = jdata_set_page_dirty,
.bmap = gfs2_bmap,
.invalidatepage = gfs2_invalidatepage,
.releasepage = gfs2_releasepage,



[Cluster-devel] GFS2: Don't assign b_private until we know it's safe

2018-02-14 Thread Bob Peterson
Hi,

Before this patch, functions gfs2_trans_add_meta and _data would
assign bh->b_private by calling gfs2_alloc_bufdata, despite the
fact that locks to protect the value are dropped, and instead, a
page lock is taken. This patch removes the page lock and adds
checks to see if someone else assigned the value after the proper
locks have been restored.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/trans.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index c75cacaa349b..32debf17c8e2 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -134,7 +134,6 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct 
gfs2_glock *gl,
bd->bd_gl = gl;
bd->bd_ops = lops;
INIT_LIST_HEAD(&bd->bd_list);
-   bh->b_private = bd;
return bd;
 }
 
@@ -179,12 +178,15 @@ void gfs2_trans_add_data(struct gfs2_glock *gl, struct 
buffer_head *bh)
if (bd == NULL) {
gfs2_log_unlock(sdp);
unlock_buffer(bh);
-   if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, &gfs2_databuf_lops);
-   else
-   bd = bh->b_private;
+   bd = gfs2_alloc_bufdata(gl, bh, &gfs2_databuf_lops);
lock_buffer(bh);
gfs2_log_lock(sdp);
+   if (bh->b_private) { /* preempted by someone else */
+   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   bd = bh->b_private;
+   } else {
+   bh->b_private = bd;
+   }
}
gfs2_assert(sdp, bd->bd_gl == gl);
set_bit(TR_TOUCHED, &tr->tr_flags);
@@ -219,14 +221,15 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
if (bd == NULL) {
gfs2_log_unlock(sdp);
unlock_buffer(bh);
-   lock_page(bh->b_page);
-   if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, &gfs2_buf_lops);
-   else
-   bd = bh->b_private;
-   unlock_page(bh->b_page);
+   bd = gfs2_alloc_bufdata(gl, bh, &gfs2_buf_lops);
lock_buffer(bh);
gfs2_log_lock(sdp);
+   if (bh->b_private) { /* preempted by someone else */
+   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   bd = bh->b_private;
+   } else {
+   bh->b_private = bd;
+   }
}
gfs2_assert(sdp, bd->bd_gl == gl);
set_bit(TR_TOUCHED, &tr->tr_flags);



[Cluster-devel] GFS2: Please pull patch tagged gfs2-4.16.rc1.fixes

2018-02-14 Thread Bob Peterson
Hi Linus,

Would you please pull this one-off patch from Andreas Gruenbacher
that fixes a GFS2 regression? Thanks.

Bob Peterson
---
The following changes since commit 35277995e17919ab838beae765f440674e8576eb:

  Merge branch 'x86-pti-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2018-02-04 11:45:55 
-0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-4.16.rc1.fixes

for you to fetch changes up to 49edd5bf429c405b3a7f75503845d9f66a47dd4b:

  gfs2: Fixes to "Implement iomap for block_map" (2018-02-13 13:38:10 -0700)


Fix regressions in patch Implement iomap for block_map

This tag is meant for pulling a patch called gfs2: Fixes to
"Implement iomap for block_map". The patch fixes some
regressions we recently discovered in commit 3974320ca6.


Andreas Gruenbacher (1):
  gfs2: Fixes to "Implement iomap for block_map"

 fs/gfs2/bmap.c | 43 +++
 1 file changed, 23 insertions(+), 20 deletions(-)



Re: [Cluster-devel] [PATCH v2] mkfs.gfs2: Scale down journal size for smaller devices

2018-02-14 Thread Bob Peterson
Hi,

Comments below.

- Original Message -
| Currently the default behaviour when the journal size is not specified
| is to use a default size of 128M, which means that mkfs.gfs2 can run out
| of space while writing to a small device. The hard default also means
| that some xfstests fail with gfs2 as they try to create small file
| systems.
| 
| This patch addresses these problems by setting sensible default journal
| sizes depending on the size of the file system. Journal sizes specified
| by the user are limited to half of the fs. As the minimum journal size
| is 8MB that means we effectively get a hard minimum file system size of
| 16MB (per journal).
| 
| Signed-off-by: Andrew Price 
| ---
| 
| v2: Andreas found that using 25% of the fs for journals was too large so this
| version separates the default journal size calculation from the check
| used
| for user-provided journal sizes, which allows for more sensible defaults.
| The default journal sizes for fs size ranges were taken from e2fsprogs.
| 
|  gfs2/libgfs2/libgfs2.h |  2 ++
|  gfs2/man/mkfs.gfs2.8   |  5 +++--
|  gfs2/mkfs/main_mkfs.c  | 56
|  --
|  tests/edit.at  |  2 +-
|  tests/mkfs.at  | 10 +
|  tests/testsuite.at |  6 ++
|  6 files changed, 76 insertions(+), 5 deletions(-)
(snip)
| + if (num_blocks < 8192*1024) /* 32 GB */
| + return (32768); /* 128 MB */
| + if (num_blocks < 16384*1024)/* 64 GB */
| + return (65536); /* 256 MB */
| + if (num_blocks < 32768*1024)/* 128 GB */
| + return (131072);/* 512 MB */
| + return 262144;  /* 1 GB */

Perhaps you can adjust the indentation on the comment so it's clear
that the journal size is 1GB in this case, not the file system size?

Here are some random thoughts on the matter:

I'm not sure I like the default journal size going up so quickly at
32GB. In most cases, 128MB journals should be adequate. I'd like to
see a much higher threshold that still uses 128MB journals.
Unless there's a high level of metadata pressure, after a certain point,
it's just wasted space.

I'd rather see 128MB journals go up to file systems of 1TB, for example.
I'm not sure it's ever worthwhile to use a 1GB journal, but I suppose
with today's faster storage and faster machines, maybe it would be.
Barry recently got some new super-fast storage; perhaps we should ask
him to test some metadata-intense benchmark to see if we can ever
push it to the point of waiting for journal writes. I'd use
instrumentation to tell us whenever journal writes need to wait for
journal space. Of course, a lot of that hinges on the bug I'm currently
working on where we often artificially wait too long for journal space.
(IOW, this is less of a concern when I get the bug fixed).

Also, don't forget that GFS2, unlike other file systems, requires a
journal for each node, and that should also be factored into the
calculations. So if you have 1TB file system and it chooses a journal
size of 1GB, but it's a 16-node cluster, you're using 16GB of space
for the journals. That's maybe not a tragedy, but it's likely to not
give them any performance benefit either. Unless they need jdata, for
example, which is heavy on journal-writes.

Don't forget also that at a certain size, GFS2 journals will can cross
resource group boundaries, and therefore have multiple segments to
manage. It may not be a big deal to carve out a 1GB journal when the
file system is shiny and new, but after two years of use, the file system
may be severely fragmented, so gfs2_jadd may add journals that are
severely fragmented, especially if they're big. Adding a 128MB journal
is less likely to get into fragmentation concerns than a 1GB journal.
Writing to a fragmented journal then becomes a slow-down because the
journal extent map needed to reference it becomes complex, and it's
used for every journal block written.

Regards

Bob Peterson
Red Hat File Systems