Re: linux-next: new contact(s) for the ceph tree?

2020-05-08 Thread Sage Weil
Jeff Layton 

Thanks, Stephen!
sage


On Sat, 9 May 2020, Stephen Rothwell wrote:

> Hi all,
> 
> I noticed commit
> 
>   3a5ccecd9af7 ("MAINTAINERS: remove myself as ceph co-maintainer")
> 
> appear recently.  So who should I now list as the contact(s) for the
> ceph tree?
> 
> -- 
> Cheers,
> Stephen Rothwell
> 


Re: [RFC PATCH] ceph: initialize superblock s_time_gran to 1

2019-06-27 Thread Sage Weil
On Thu, 27 Jun 2019, Jeff Layton wrote:
> On Thu, 2019-06-27 at 14:51 +0100, Luis Henriques wrote:
> > Having granularity set to 1us results in having inode timestamps with a
> > accurancy different from the fuse client (i.e. atime, ctime and mtime will
> > always end with '000').  This patch normalizes this behaviour and sets the
> > granularity to 1.
> > 
> > Signed-off-by: Luis Henriques 
> > ---
> >  fs/ceph/super.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > Hi!
> > 
> > As far as I could see there are no other side-effects of changing
> > s_time_gran but I'm really not sure why it was initially set to 1000 in
> > the first place so I may be missing something.
> > 
> > diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> > index d57fa60dcd43..35dd75bc9cd0 100644
> > --- a/fs/ceph/super.c
> > +++ b/fs/ceph/super.c
> > @@ -980,7 +980,7 @@ static int ceph_set_super(struct super_block *s, void 
> > *data)
> > s->s_d_op = _dentry_ops;
> > s->s_export_op = _export_ops;
> >  
> > -   s->s_time_gran = 1000;  /* 1000 ns == 1 us */
> > +   s->s_time_gran = 1;
> >  
> > ret = set_anon_super(s, NULL);  /* what is that second arg for? */
> > if (ret != 0)
> 
> 
> Looks like it was set that way since the client code was originally
> merged. Was this an earlier limitation of ceph that is no longer
> applicable?
> 
> In any case, I see no need at all to keep this at 1000, so:

As long as the encoded on-write time value is at ns resolution, I 
agree!  No recollection of why I did this :(

Reviewed-by: Sage Weil 


[GIT PULL] Ceph fixes for -rc2

2016-06-03 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

We have a few follow-up fixes for the libceph refactor from Ilya, and then 
some cephfs + fscache fixes from Zheng.  The first two FS-Cache patches 
are acked by David Howells and deemed trivial enough to go through our 
tree.  The rest fix some issues with the ceph fscache handling (disable 
cache for inodes opened for write, and simplify the revalidation logic 
accordingly, dropping the now-unnecessary work queue).

Thanks!
sage



Ilya Dryomov (3):
  libceph: change ceph_osdmap_flag() to take osdc
  libceph: put request only if it's done in handle_reply()
  libceph: use %s instead of %pE in dout()s

Yan, Zheng (7):
  FS-Cache: wake write waiter after invalidating writes
  FS-Cache: make check_consistency callback return int
  ceph: call __fscache_uncache_page() if readpages fails
  ceph: avoid unnecessary fscache invalidation/revlidation
  ceph: disable fscache when inode is opened for write
  ceph: improve fscache revalidation
  ceph: use i_version to check validity of fscache

 fs/cachefiles/interface.c   |   2 +-
 fs/ceph/addr.c  |   6 +-
 fs/ceph/cache.c | 141 +---
 fs/ceph/cache.h |  44 -
 fs/ceph/caps.c  |  23 +++
 fs/ceph/file.c  |  27 ++--
 fs/ceph/super.h |   4 +-
 fs/fscache/page.c   |   2 +
 include/linux/ceph/osd_client.h |   5 ++
 include/linux/ceph/osdmap.h |   5 --
 include/linux/fscache-cache.h   |   2 +-
 net/ceph/osd_client.c   |  51 +++
 net/ceph/osdmap.c   |   4 +-
 13 files changed, 138 insertions(+), 178 deletions(-)


[GIT PULL] Ceph fixes for -rc2

2016-06-03 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

We have a few follow-up fixes for the libceph refactor from Ilya, and then 
some cephfs + fscache fixes from Zheng.  The first two FS-Cache patches 
are acked by David Howells and deemed trivial enough to go through our 
tree.  The rest fix some issues with the ceph fscache handling (disable 
cache for inodes opened for write, and simplify the revalidation logic 
accordingly, dropping the now-unnecessary work queue).

Thanks!
sage



Ilya Dryomov (3):
  libceph: change ceph_osdmap_flag() to take osdc
  libceph: put request only if it's done in handle_reply()
  libceph: use %s instead of %pE in dout()s

Yan, Zheng (7):
  FS-Cache: wake write waiter after invalidating writes
  FS-Cache: make check_consistency callback return int
  ceph: call __fscache_uncache_page() if readpages fails
  ceph: avoid unnecessary fscache invalidation/revlidation
  ceph: disable fscache when inode is opened for write
  ceph: improve fscache revalidation
  ceph: use i_version to check validity of fscache

 fs/cachefiles/interface.c   |   2 +-
 fs/ceph/addr.c  |   6 +-
 fs/ceph/cache.c | 141 +---
 fs/ceph/cache.h |  44 -
 fs/ceph/caps.c  |  23 +++
 fs/ceph/file.c  |  27 ++--
 fs/ceph/super.h |   4 +-
 fs/fscache/page.c   |   2 +
 include/linux/ceph/osd_client.h |   5 ++
 include/linux/ceph/osdmap.h |   5 --
 include/linux/fscache-cache.h   |   2 +-
 net/ceph/osd_client.c   |  51 +++
 net/ceph/osdmap.c   |   4 +-
 13 files changed, 138 insertions(+), 178 deletions(-)


Re: [GIT PULL] Ceph updates for 4.7-rc1

2016-05-26 Thread Sage Weil
On Thu, 26 May 2016, Linus Torvalds wrote:
> On Thu, May 26, 2016 at 11:31 AM, Linus Torvalds
>  wrote:
> >
> > Pulled and then immediately unpulled again.
> 
> .. and having thought it over, I ended up re-pulling again, so now
> it's going through my build test.
> 
> Consider this discussion a strong encouragement to *not* do this in
> the future - sending me pull requests at the end of the merge window
> without them having been in linux-next is a no-no, unless those pull
> requests are small and trivial (or have fixes that I'd pull even
> outside the merge window, of course).

Thank you!  We'll be sure we include things in -next well beforehand next 
time around, especially if it's a big diff like this one.

One point of clarification, though: in the past I've squashed down fixes 
discovered during testing if the branch hasn't hit a stable tree yet 
(e.g., your tree).  AIUI this is(was?) standard procedure for things in 
-next.  Do you want us to avoid squashing if we are creeping up on pull 
request time, or are you primarily interested in, say, seeing that what 
has been in -next for a while is substantially the same as what you pull, 
and has perhaps been there unmodified for at least a few days?  Or would 
you rather see fixup patches if we identify issues in the last few days of 
testing?

Thanks-
sage


Re: [GIT PULL] Ceph updates for 4.7-rc1

2016-05-26 Thread Sage Weil
On Thu, 26 May 2016, Linus Torvalds wrote:
> On Thu, May 26, 2016 at 11:31 AM, Linus Torvalds
>  wrote:
> >
> > Pulled and then immediately unpulled again.
> 
> .. and having thought it over, I ended up re-pulling again, so now
> it's going through my build test.
> 
> Consider this discussion a strong encouragement to *not* do this in
> the future - sending me pull requests at the end of the merge window
> without them having been in linux-next is a no-no, unless those pull
> requests are small and trivial (or have fixes that I'd pull even
> outside the merge window, of course).

Thank you!  We'll be sure we include things in -next well beforehand next 
time around, especially if it's a big diff like this one.

One point of clarification, though: in the past I've squashed down fixes 
discovered during testing if the branch hasn't hit a stable tree yet 
(e.g., your tree).  AIUI this is(was?) standard procedure for things in 
-next.  Do you want us to avoid squashing if we are creeping up on pull 
request time, or are you primarily interested in, say, seeing that what 
has been in -next for a while is substantially the same as what you pull, 
and has perhaps been there unmodified for at least a few days?  Or would 
you rather see fixup patches if we identify issues in the last few days of 
testing?

Thanks-
sage


Re: [GIT PULL] Ceph updates for 4.7-rc1

2016-05-26 Thread Sage Weil
On Thu, 26 May 2016, Linus Torvalds wrote:
> On Thu, May 26, 2016 at 11:18 AM, Sage Weil <sw...@redhat.com> wrote:
> >
> > Please pull the following Ceph updates from
> 
> Why was that branch rebased yesterday?
> 
> What has been in next, if anything?
> 
> And if something has been in next, why was _that_ not sent to me?

The branch was assembled in its current form yesterday and is included in 
today's -next:


https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e536030934aebf049fe6aaebc58dd37aeee21840

The same commit went through our internal testing last night, and we've 
been testing the code for the better part of a week internally.

If you want it to bake longer in -next first, let us know.  We're not 
causing merge conflicts, and there isn't -next-based ceph testing that I'm 
aware of going on outside of our own QA environment, so I'm not sure how 
valuable it is, but I'm happy to delay before sending a pull request if 
that's what you want to see.

Thanks-
sage


Re: [GIT PULL] Ceph updates for 4.7-rc1

2016-05-26 Thread Sage Weil
On Thu, 26 May 2016, Linus Torvalds wrote:
> On Thu, May 26, 2016 at 11:18 AM, Sage Weil  wrote:
> >
> > Please pull the following Ceph updates from
> 
> Why was that branch rebased yesterday?
> 
> What has been in next, if anything?
> 
> And if something has been in next, why was _that_ not sent to me?

The branch was assembled in its current form yesterday and is included in 
today's -next:


https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e536030934aebf049fe6aaebc58dd37aeee21840

The same commit went through our internal testing last night, and we've 
been testing the code for the better part of a week internally.

If you want it to bake longer in -next first, let us know.  We're not 
causing merge conflicts, and there isn't -next-based ceph testing that I'm 
aware of going on outside of our own QA environment, so I'm not sure how 
valuable it is, but I'm happy to delay before sending a pull request if 
that's what you want to see.

Thanks-
sage


[GIT PULL] Ceph updates for 4.7-rc1

2016-05-26 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This changeset has a few main parts:

 * Ilya has finished a huge refactoring effort to sync up the client-side 
logic in libceph with the user-space client code, which has evolved 
significantly over the last couple years, with lots of additional 
behaviors (e.g., how requests are handled when cluster is full and 
transitions from full to non-full).  This structure of the code is more 
closely aligned with userspace now such that it will be much easier to 
maintain going forward when behavior changes take place.  There are some 
locking improvements bundled in as well.

 * Zheng adds multi-filesystem support (multiple namespaces within the 
same Ceph cluster)

 * Zheng has changed the readdir offsets and directory enumeration so that 
dentry offsets are hash-based and therefore stable across directory 
fragmentation events on the MDS.

 * Zheng has a smorgasbord of bug fixes across fs/ceph.

Thanks!
sage




Ilya Dryomov (40):
  rbd: get/put img_request in rbd_img_request_submit()
  libceph: make ceph_osdc_put_request() accept NULL
  libceph: grab snapc in ceph_osdc_alloc_request()
  libceph: move message allocation out of ceph_osdc_alloc_request()
  libceph: change how osd_op_reply message size is calculated
  libceph: variable-sized ceph_object_id
  rbd: use header_oid instead of header_name
  libceph: nuke unused fields and functions
  libceph: open-code remove_{all,old}_osds()
  libceph: DEFINE_RB_FUNCS macro
  libceph: fix ceph_eversion encoding
  libceph: rename ceph_oloc_oid_to_pg()
  libceph: ceph_osds, ceph_pg_to_up_acting_osds()
  libceph: rename ceph_calc_pg_primary()
  libceph: make pgid_cmp() global
  libceph: pi->min_size, pi->last_force_request_resend
  libceph: introduce ceph_osd_request_target, calc_target()
  libceph: switch to calc_target(), part 1
  libceph: switch to calc_target(), part 2
  libceph: drop msg argument from ceph_osdc_callback_t
  libceph: redo callbacks and factor out MOSDOpReply decoding
  libceph: move schedule_delayed_work() in ceph_osdc_init()
  libceph: schedule tick from ceph_osdc_init()
  libceph: allocate dummy osdmap in ceph_osdc_init()
  libceph: handle_one_map()
  libceph: osd_init() and osd_cleanup()
  libceph: allocate ceph_osd with GFP_NOFAIL
  libceph: protect osdc->osd_lru list with a spinlock
  libceph: a major OSD client update
  libceph: request_init() and request_release_checks()
  libceph: wait_request_timeout()
  rbd: rbd_dev_header_unwatch_sync() variant
  libceph, rbd: ceph_osd_linger_request, watch/notify v2
  libceph: support for sending notifies
  libceph: support for checking on status of watch
  libceph: async MON client generic requests
  libceph: pool deletion detection
  libceph: take osdc->lock in osdmap_show() and dump flags in hex
  libceph: replace ceph_monc_request_next_osdmap()
  libceph: support for subscribing to "mdsmap." maps

Yan, Zheng (30):
  ceph: multiple filesystem support
  ceph: CEPH_FEATURE_MDSENC support
  ceph: renew caps for read/write if mds session got killed.
  ceph: don't call truncate_pagecache in ceph_writepages_start
  ceph: don't show symlink target in debugfs/mdsc
  ceph: report mount root in session metadata
  ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr
  ceph: search cache postion for dcache readdir
  ceph: remove unnecessary checks in __dcache_readdir
  ceph: simplify 'offset in frag'
  ceph: define struct for dir entry in readdir reply
  ceph: define 'end/complete' in readdir reply as bit flags
  ceph: record 'offset' for each entry of readdir result
  ceph: don't forbid marking directory complete after forward seek
  ceph: using hash value to compose dentry offset
  ceph: fix inode reference leak
  ceph: don't assume frag tree splits in mds reply are sorted
  ceph: fix dir_auth check in ceph_fill_dirfrag()
  ceph: keep leaf frag when updating fragtree
  ceph: improve fragtree change detection
  ceph: tolerate bad i_size for symlink inode
  ceph: block non-fatal signals for fault/page_mkwrite
  ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
  ceph: handle -EAGAIN returned by ceph_update_writeable_page()
  libceph: make ceph_osdc_wait_request() uninterruptible
  ceph: make ceph_update_writeable_page() uninterruptible
  ceph: handle interrupted ceph_writepage()
  ceph: SetPageError() for writeback pages if writepages fails
  ceph: don't use truncate_pagecache() to invalidate read cache
  ceph: fix wake_up_session_cb()

Zhang Zhuoyu (1):
  ceph: make logical calculation functions return bool

 

[GIT PULL] Ceph updates for 4.7-rc1

2016-05-26 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This changeset has a few main parts:

 * Ilya has finished a huge refactoring effort to sync up the client-side 
logic in libceph with the user-space client code, which has evolved 
significantly over the last couple years, with lots of additional 
behaviors (e.g., how requests are handled when cluster is full and 
transitions from full to non-full).  This structure of the code is more 
closely aligned with userspace now such that it will be much easier to 
maintain going forward when behavior changes take place.  There are some 
locking improvements bundled in as well.

 * Zheng adds multi-filesystem support (multiple namespaces within the 
same Ceph cluster)

 * Zheng has changed the readdir offsets and directory enumeration so that 
dentry offsets are hash-based and therefore stable across directory 
fragmentation events on the MDS.

 * Zheng has a smorgasbord of bug fixes across fs/ceph.

Thanks!
sage




Ilya Dryomov (40):
  rbd: get/put img_request in rbd_img_request_submit()
  libceph: make ceph_osdc_put_request() accept NULL
  libceph: grab snapc in ceph_osdc_alloc_request()
  libceph: move message allocation out of ceph_osdc_alloc_request()
  libceph: change how osd_op_reply message size is calculated
  libceph: variable-sized ceph_object_id
  rbd: use header_oid instead of header_name
  libceph: nuke unused fields and functions
  libceph: open-code remove_{all,old}_osds()
  libceph: DEFINE_RB_FUNCS macro
  libceph: fix ceph_eversion encoding
  libceph: rename ceph_oloc_oid_to_pg()
  libceph: ceph_osds, ceph_pg_to_up_acting_osds()
  libceph: rename ceph_calc_pg_primary()
  libceph: make pgid_cmp() global
  libceph: pi->min_size, pi->last_force_request_resend
  libceph: introduce ceph_osd_request_target, calc_target()
  libceph: switch to calc_target(), part 1
  libceph: switch to calc_target(), part 2
  libceph: drop msg argument from ceph_osdc_callback_t
  libceph: redo callbacks and factor out MOSDOpReply decoding
  libceph: move schedule_delayed_work() in ceph_osdc_init()
  libceph: schedule tick from ceph_osdc_init()
  libceph: allocate dummy osdmap in ceph_osdc_init()
  libceph: handle_one_map()
  libceph: osd_init() and osd_cleanup()
  libceph: allocate ceph_osd with GFP_NOFAIL
  libceph: protect osdc->osd_lru list with a spinlock
  libceph: a major OSD client update
  libceph: request_init() and request_release_checks()
  libceph: wait_request_timeout()
  rbd: rbd_dev_header_unwatch_sync() variant
  libceph, rbd: ceph_osd_linger_request, watch/notify v2
  libceph: support for sending notifies
  libceph: support for checking on status of watch
  libceph: async MON client generic requests
  libceph: pool deletion detection
  libceph: take osdc->lock in osdmap_show() and dump flags in hex
  libceph: replace ceph_monc_request_next_osdmap()
  libceph: support for subscribing to "mdsmap." maps

Yan, Zheng (30):
  ceph: multiple filesystem support
  ceph: CEPH_FEATURE_MDSENC support
  ceph: renew caps for read/write if mds session got killed.
  ceph: don't call truncate_pagecache in ceph_writepages_start
  ceph: don't show symlink target in debugfs/mdsc
  ceph: report mount root in session metadata
  ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr
  ceph: search cache postion for dcache readdir
  ceph: remove unnecessary checks in __dcache_readdir
  ceph: simplify 'offset in frag'
  ceph: define struct for dir entry in readdir reply
  ceph: define 'end/complete' in readdir reply as bit flags
  ceph: record 'offset' for each entry of readdir result
  ceph: don't forbid marking directory complete after forward seek
  ceph: using hash value to compose dentry offset
  ceph: fix inode reference leak
  ceph: don't assume frag tree splits in mds reply are sorted
  ceph: fix dir_auth check in ceph_fill_dirfrag()
  ceph: keep leaf frag when updating fragtree
  ceph: improve fragtree change detection
  ceph: tolerate bad i_size for symlink inode
  ceph: block non-fatal signals for fault/page_mkwrite
  ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
  ceph: handle -EAGAIN returned by ceph_update_writeable_page()
  libceph: make ceph_osdc_wait_request() uninterruptible
  ceph: make ceph_update_writeable_page() uninterruptible
  ceph: handle interrupted ceph_writepage()
  ceph: SetPageError() for writeback pages if writepages fails
  ceph: don't use truncate_pagecache() to invalidate read cache
  ceph: fix wake_up_session_cb()

Zhang Zhuoyu (1):
  ceph: make logical calculation functions return bool

 

[GIT PULL] Ceph fixes for -rc6

2016-04-28 Thread Sage Weil
Hi Linus,

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is a lifecycle fix in the auth code, a fix for a narrow race 
condition on map, and a helpful message in the log when there is a feature 
mismatch (which happens frequently now that the default server-side 
options have changed).

Thanks!
sage




Ilya Dryomov (3):
  libceph: make authorizer destruction independent of ceph_auth_client
  rbd: fix rbd map vs notify races
  rbd: report unsupported features to syslog

 drivers/block/rbd.c | 52 +++---
 fs/ceph/mds_client.c|  6 ++--
 include/linux/ceph/auth.h   | 10 +++---
 include/linux/ceph/osd_client.h |  1 -
 net/ceph/auth.c |  8 ++---
 net/ceph/auth_none.c| 71 ++---
 net/ceph/auth_none.h|  3 +-
 net/ceph/auth_x.c   | 21 ++--
 net/ceph/auth_x.h   |  1 +
 net/ceph/osd_client.c   |  6 ++--
 10 files changed, 87 insertions(+), 92 deletions(-)


[GIT PULL] Ceph fixes for -rc6

2016-04-28 Thread Sage Weil
Hi Linus,

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is a lifecycle fix in the auth code, a fix for a narrow race 
condition on map, and a helpful message in the log when there is a feature 
mismatch (which happens frequently now that the default server-side 
options have changed).

Thanks!
sage




Ilya Dryomov (3):
  libceph: make authorizer destruction independent of ceph_auth_client
  rbd: fix rbd map vs notify races
  rbd: report unsupported features to syslog

 drivers/block/rbd.c | 52 +++---
 fs/ceph/mds_client.c|  6 ++--
 include/linux/ceph/auth.h   | 10 +++---
 include/linux/ceph/osd_client.h |  1 -
 net/ceph/auth.c |  8 ++---
 net/ceph/auth_none.c| 71 ++---
 net/ceph/auth_none.h|  3 +-
 net/ceph/auth_x.c   | 21 ++--
 net/ceph/auth_x.h   |  1 +
 net/ceph/osd_client.c   |  6 ++--
 10 files changed, 87 insertions(+), 92 deletions(-)


[GIT PULL] Ceph fix for -rc3

2016-04-07 Thread Sage Weil
[This time with correct To: line :)]

Hi Linus,

Please pull the following Ceph RBD patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This just fixes a few remaining memory allocations in RBD to use GFP_NOIO 
instead of GFP_ATOMIC.

Thanks!
sage


David Disseldorp (1):
  rbd: use GFP_NOIO consistently for request allocations

 drivers/block/rbd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)



[GIT PULL] Ceph fix for -rc3

2016-04-07 Thread Sage Weil
[This time with correct To: line :)]

Hi Linus,

Please pull the following Ceph RBD patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This just fixes a few remaining memory allocations in RBD to use GFP_NOIO 
instead of GFP_ATOMIC.

Thanks!
sage


David Disseldorp (1):
  rbd: use GFP_NOIO consistently for request allocations

 drivers/block/rbd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)



[GIT PULL] Ceph fix for -rc3

2016-04-07 Thread Sage Weil
Hi Linus,

Please pull the following Ceph RBD patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This just fixes a few remaining memory allocations in RBD to use GFP_NOIO 
instead of GFP_ATOMIC.

Thanks!
sage


David Disseldorp (1):
  rbd: use GFP_NOIO consistently for request allocations

 drivers/block/rbd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


[GIT PULL] Ceph fix for -rc3

2016-04-07 Thread Sage Weil
Hi Linus,

Please pull the following Ceph RBD patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This just fixes a few remaining memory allocations in RBD to use GFP_NOIO 
instead of GFP_ATOMIC.

Thanks!
sage


David Disseldorp (1):
  rbd: use GFP_NOIO consistently for request allocations

 drivers/block/rbd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)


[GIT PULL] Ceph updates for 4.6-rc1

2016-03-25 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is quite a bit here, including some overdue refactoring and cleanup 
on the mon_client and osd_client code from Ilya, scattered writeback 
support for CephFS and a pile of bug fixes from Zheng, and a few random 
cleanups and fixes from others.

This series is based on a recent merge of Al's tree to avoid conflicts 
with his splice_dentry changes.

Thanks!
sage


Anton Protopopov (1):
  ceph: fix a wrong comparison

Deepa Dinamani (1):
  ceph: replace CURRENT_TIME by current_fs_time()

Geliang Tang (3):
  rbd: use KMEM_CACHE macro
  ceph: use kmem_cache_zalloc
  libceph: use KMEM_CACHE macro

Ilya Dryomov (15):
  libceph: move debugfs initialization into __ceph_open_session()
  libceph: decouple hunting and subs management
  libceph: revamp subs code, switch to SUBSCRIBE2 protocol
  libceph: pick a different monitor when reconnecting
  libceph: monc ping rate is 10s
  libceph: monc hunt rate is 3s with backoff up to 30s
  libceph: introduce and switch to reopen_session()
  libceph: reschedule tick in mon_fault()
  libceph: behave in mon_fault() if cur_mon < 0
  libceph: rename ceph_osd_req_op::payload_len to indata_len
  libceph: make r_request msg_size calculation clearer
  libceph: osdc->req_mempool should be backed by a slab pool
  libceph: enable large, variable-sized OSD requests
  ceph: kill ceph_empty_snapc
  libceph: use sizeof_footer() more

Yan, Zheng (14):
  ceph: encode ctime in cap message
  ceph: don't enable rbytes mount option by default
  ceph: remove useless BUG_ON
  libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op
  libceph: add helper that duplicates last extent operation
  ceph: scattered page writeback
  ceph: fix race during filling readdir cache
  ceph: avoid updating directory inode's i_size accidentally
  ceph: remove unnecessary NULL check
  ceph: fix mounting same fs multiple times
  ceph: don't request vxattrs from MDS
  ceph: fix security xattr deadlock
  ceph: kill ceph_get_dentry_parent_inode()
  ceph: use lookup request to revalidate dentry

 drivers/block/rbd.c|  14 +-
 fs/ceph/addr.c | 324 --
 fs/ceph/caps.c |  11 +-
 fs/ceph/dir.c  |  69 --
 fs/ceph/export.c   |  13 ++
 fs/ceph/file.c |  15 +-
 fs/ceph/inode.c|  34 ++-
 fs/ceph/mds_client.c   |   7 +-
 fs/ceph/snap.c |  16 --
 fs/ceph/super.c|  47 ++--
 fs/ceph/super.h|  23 +-
 fs/ceph/xattr.c|  78 ++-
 include/linux/ceph/ceph_features.h |   2 +
 include/linux/ceph/ceph_fs.h   |   7 +-
 include/linux/ceph/libceph.h   |   8 +-
 include/linux/ceph/mon_client.h|  31 ++-
 include/linux/ceph/osd_client.h|  15 +-
 net/ceph/ceph_common.c |   4 +-
 net/ceph/debugfs.c |  17 +-
 net/ceph/messenger.c   |  29 +--
 net/ceph/mon_client.c  | 457 -
 net/ceph/osd_client.c  | 109 ++---
 22 files changed, 811 insertions(+), 519 deletions(-)


[GIT PULL] Ceph updates for 4.6-rc1

2016-03-25 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is quite a bit here, including some overdue refactoring and cleanup 
on the mon_client and osd_client code from Ilya, scattered writeback 
support for CephFS and a pile of bug fixes from Zheng, and a few random 
cleanups and fixes from others.

This series is based on a recent merge of Al's tree to avoid conflicts 
with his splice_dentry changes.

Thanks!
sage


Anton Protopopov (1):
  ceph: fix a wrong comparison

Deepa Dinamani (1):
  ceph: replace CURRENT_TIME by current_fs_time()

Geliang Tang (3):
  rbd: use KMEM_CACHE macro
  ceph: use kmem_cache_zalloc
  libceph: use KMEM_CACHE macro

Ilya Dryomov (15):
  libceph: move debugfs initialization into __ceph_open_session()
  libceph: decouple hunting and subs management
  libceph: revamp subs code, switch to SUBSCRIBE2 protocol
  libceph: pick a different monitor when reconnecting
  libceph: monc ping rate is 10s
  libceph: monc hunt rate is 3s with backoff up to 30s
  libceph: introduce and switch to reopen_session()
  libceph: reschedule tick in mon_fault()
  libceph: behave in mon_fault() if cur_mon < 0
  libceph: rename ceph_osd_req_op::payload_len to indata_len
  libceph: make r_request msg_size calculation clearer
  libceph: osdc->req_mempool should be backed by a slab pool
  libceph: enable large, variable-sized OSD requests
  ceph: kill ceph_empty_snapc
  libceph: use sizeof_footer() more

Yan, Zheng (14):
  ceph: encode ctime in cap message
  ceph: don't enable rbytes mount option by default
  ceph: remove useless BUG_ON
  libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op
  libceph: add helper that duplicates last extent operation
  ceph: scattered page writeback
  ceph: fix race during filling readdir cache
  ceph: avoid updating directory inode's i_size accidentally
  ceph: remove unnecessary NULL check
  ceph: fix mounting same fs multiple times
  ceph: don't request vxattrs from MDS
  ceph: fix security xattr deadlock
  ceph: kill ceph_get_dentry_parent_inode()
  ceph: use lookup request to revalidate dentry

 drivers/block/rbd.c|  14 +-
 fs/ceph/addr.c | 324 --
 fs/ceph/caps.c |  11 +-
 fs/ceph/dir.c  |  69 --
 fs/ceph/export.c   |  13 ++
 fs/ceph/file.c |  15 +-
 fs/ceph/inode.c|  34 ++-
 fs/ceph/mds_client.c   |   7 +-
 fs/ceph/snap.c |  16 --
 fs/ceph/super.c|  47 ++--
 fs/ceph/super.h|  23 +-
 fs/ceph/xattr.c|  78 ++-
 include/linux/ceph/ceph_features.h |   2 +
 include/linux/ceph/ceph_fs.h   |   7 +-
 include/linux/ceph/libceph.h   |   8 +-
 include/linux/ceph/mon_client.h|  31 ++-
 include/linux/ceph/osd_client.h|  15 +-
 net/ceph/ceph_common.c |   4 +-
 net/ceph/debugfs.c |  17 +-
 net/ceph/messenger.c   |  29 +--
 net/ceph/mon_client.c  | 457 -
 net/ceph/osd_client.c  | 109 ++---
 22 files changed, 811 insertions(+), 519 deletions(-)


Re: [ceph] what's going on with d_rehash() in splice_dentry()?

2016-03-07 Thread Sage Weil
On Mon, 7 Mar 2016, Al Viro wrote:
> On Wed, Mar 02, 2016 at 11:00:01AM +0800, Yan, Zheng wrote:
> 
> > > This code dates back to when Ceph was originally upstreamed, so the 
> > > history is murky, but I expect at that point I wanted to avoid hashing in 
> > > the no-lease case.  But I don't think it matters.  We should just remove 
> > > the prehash argument from splice_dentry entirely.
> > > 
> > > Zheng, does that sound right?
> > 
> > Yes. I think we can remove the d_rehash(dn) call and rehash parameter.
> 
> Another question in the same general area:
> /* null dentry? */
> if (!rinfo->head->is_target) {
> dout("fill_trace null dentry\n");
> if (d_really_is_positive(dn)) {
> ceph_dir_clear_ordered(dir);
> dout("d_delete %p\n", dn);
> d_delete(dn);
> } else {
> dout("d_instantiate %p NULL\n", dn);
> d_instantiate(dn, NULL);
> if (have_lease && d_unhashed(dn))
> d_rehash(dn);
> update_dentry_lease(dn, rinfo->dlease,
> session,
> req->r_request_started);
> }
> goto done;
> }
> What's that d_instantiate() about?  We have just checked that it's
> negative; what's the point of setting ->d_inode to NULL again?  Would it
> be OK if we just do
>   } else {
>   if (have_lease && d_unhashed(dn))
>   d_add(dn, NULL);
> update_dentry_lease(dn, rinfo->dlease,
> session,
> req->r_request_started);
> }
> in there?

That looks okay, but changing d_rehash to d_add still means you're doing 
te d_instantiate(dn, NULL) in the d_unhashed case; is there a reason you 
changed that line?  Is the dentry_rcuwalk_invalidate in __d_instantiate is 
important before rehashing?

> As an aside, tracking back to the originating fs method is
> painful as hell ;-/  I _think_ that rehash can be hit during ->lookup()
> returning a negative, but I wouldn't bet a dime on it not happening from
> other methods...  AFAICS, the change should be OK regardless of what
> it's been called from, but... _ouch_.  Is is documented anywhere public?

It is a pain to follow, yes. FWIW this whole block is predicated in 
req->r_locked_dir being non-NULL (i.e., VFS holds dir->i_mutex), which is 
only true for lookup, create operations (mkdir/mknod/symlink/etc.), 
atomic_open, and the .get_name export op.  There's not much documentation 
beyond a description of the meaning of fields (e.g. r_locked_dir) in 
fs/ceph/mds_client.h ...

sage



Re: [ceph] what's going on with d_rehash() in splice_dentry()?

2016-03-07 Thread Sage Weil
On Mon, 7 Mar 2016, Al Viro wrote:
> On Wed, Mar 02, 2016 at 11:00:01AM +0800, Yan, Zheng wrote:
> 
> > > This code dates back to when Ceph was originally upstreamed, so the 
> > > history is murky, but I expect at that point I wanted to avoid hashing in 
> > > the no-lease case.  But I don't think it matters.  We should just remove 
> > > the prehash argument from splice_dentry entirely.
> > > 
> > > Zheng, does that sound right?
> > 
> > Yes. I think we can remove the d_rehash(dn) call and rehash parameter.
> 
> Another question in the same general area:
> /* null dentry? */
> if (!rinfo->head->is_target) {
> dout("fill_trace null dentry\n");
> if (d_really_is_positive(dn)) {
> ceph_dir_clear_ordered(dir);
> dout("d_delete %p\n", dn);
> d_delete(dn);
> } else {
> dout("d_instantiate %p NULL\n", dn);
> d_instantiate(dn, NULL);
> if (have_lease && d_unhashed(dn))
> d_rehash(dn);
> update_dentry_lease(dn, rinfo->dlease,
> session,
> req->r_request_started);
> }
> goto done;
> }
> What's that d_instantiate() about?  We have just checked that it's
> negative; what's the point of setting ->d_inode to NULL again?  Would it
> be OK if we just do
>   } else {
>   if (have_lease && d_unhashed(dn))
>   d_add(dn, NULL);
> update_dentry_lease(dn, rinfo->dlease,
> session,
> req->r_request_started);
> }
> in there?

That looks okay, but changing d_rehash to d_add still means you're doing 
te d_instantiate(dn, NULL) in the d_unhashed case; is there a reason you 
changed that line?  Is the dentry_rcuwalk_invalidate in __d_instantiate is 
important before rehashing?

> As an aside, tracking back to the originating fs method is
> painful as hell ;-/  I _think_ that rehash can be hit during ->lookup()
> returning a negative, but I wouldn't bet a dime on it not happening from
> other methods...  AFAICS, the change should be OK regardless of what
> it's been called from, but... _ouch_.  Is is documented anywhere public?

It is a pain to follow, yes. FWIW this whole block is predicated in 
req->r_locked_dir being non-NULL (i.e., VFS holds dir->i_mutex), which is 
only true for lookup, create operations (mkdir/mknod/symlink/etc.), 
atomic_open, and the .get_name export op.  There's not much documentation 
beyond a description of the meaning of fields (e.g. r_locked_dir) in 
fs/ceph/mds_client.h ...

sage



[GIT PULL] Ceph fixes for -rc7

2016-03-05 Thread Sage Weil
Hi Linus,

Please pull the following Ceph patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This is a final commit we missed to align the protocol compatibility with 
the feature bits.  It decodes a few extra fields in two different messages 
and reports EIO when they are used (not yet supported).

Thanks!
sage



Yan, Zheng (1):
  ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support

 fs/ceph/addr.c |  4 
 fs/ceph/caps.c | 27 ---
 fs/ceph/inode.c|  2 ++
 fs/ceph/mds_client.c   | 16 
 fs/ceph/mds_client.h   |  1 +
 fs/ceph/super.h|  1 +
 include/linux/ceph/ceph_features.h |  1 +
 7 files changed, 49 insertions(+), 3 deletions(-)


[GIT PULL] Ceph fixes for -rc7

2016-03-05 Thread Sage Weil
Hi Linus,

Please pull the following Ceph patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This is a final commit we missed to align the protocol compatibility with 
the feature bits.  It decodes a few extra fields in two different messages 
and reports EIO when they are used (not yet supported).

Thanks!
sage



Yan, Zheng (1):
  ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support

 fs/ceph/addr.c |  4 
 fs/ceph/caps.c | 27 ---
 fs/ceph/inode.c|  2 ++
 fs/ceph/mds_client.c   | 16 
 fs/ceph/mds_client.h   |  1 +
 fs/ceph/super.h|  1 +
 include/linux/ceph/ceph_features.h |  1 +
 7 files changed, 49 insertions(+), 3 deletions(-)


Re: [ceph] what's going on with d_rehash() in splice_dentry()?

2016-03-01 Thread Sage Weil
Hi Al,

On Fri, 26 Feb 2016, Al Viro wrote:
> You have, modulo printks and BUG_ON(),
> {
> struct dentry *realdn;
> /* dn must be unhashed */
> if (!d_unhashed(dn))
> d_drop(dn);
> realdn = d_splice_alias(in, dn);
> if (IS_ERR(realdn)) {
> if (prehash)
> *prehash = false; /* don't rehash on error */
> dn = realdn; /* note realdn contains the error */
> goto out;
> } else if (realdn) {
> dput(dn);
> dn = realdn;
> }
> if ((!prehash || *prehash) && d_unhashed(dn))
> d_rehash(dn);
> 
> When d_splice_alias() returns NULL it has hashed the dentry you'd given it;
> when it returns a different dentry, that dentry is also returned hashed.
> IOW, d_rehash(dn) in there should never be called.
> 
> If you have a case when it _is_ called, you've found a bug somewhere and
> I'd like to see details.  AFAICS, the whole prehash thing appears to be
> pointless - even the place where we modify *prehash, since in that case
> we return ERR_PTR() and the only caller passing non-NULL prehash (_lease)
> buggers off on such return value past all code that would look at have_lease
> value.

Right.
 
> One possible reading is that you want to prevent hashing in !have_lease
> case of
> dn = splice_dentry(dn, in, _lease);
> If that's the case, you might have a problem, since it will be hashed no
> matter what...

In this case it doesn't actually matter if it is hashed or not, since 
we will look at the lease state on the dentry before trusting it...

This code dates back to when Ceph was originally upstreamed, so the 
history is murky, but I expect at that point I wanted to avoid hashing in 
the no-lease case.  But I don't think it matters.  We should just remove 
the prehash argument from splice_dentry entirely.

Zheng, does that sound right?

Thanks!
sage


Re: [ceph] what's going on with d_rehash() in splice_dentry()?

2016-03-01 Thread Sage Weil
Hi Al,

On Fri, 26 Feb 2016, Al Viro wrote:
> You have, modulo printks and BUG_ON(),
> {
> struct dentry *realdn;
> /* dn must be unhashed */
> if (!d_unhashed(dn))
> d_drop(dn);
> realdn = d_splice_alias(in, dn);
> if (IS_ERR(realdn)) {
> if (prehash)
> *prehash = false; /* don't rehash on error */
> dn = realdn; /* note realdn contains the error */
> goto out;
> } else if (realdn) {
> dput(dn);
> dn = realdn;
> }
> if ((!prehash || *prehash) && d_unhashed(dn))
> d_rehash(dn);
> 
> When d_splice_alias() returns NULL it has hashed the dentry you'd given it;
> when it returns a different dentry, that dentry is also returned hashed.
> IOW, d_rehash(dn) in there should never be called.
> 
> If you have a case when it _is_ called, you've found a bug somewhere and
> I'd like to see details.  AFAICS, the whole prehash thing appears to be
> pointless - even the place where we modify *prehash, since in that case
> we return ERR_PTR() and the only caller passing non-NULL prehash (_lease)
> buggers off on such return value past all code that would look at have_lease
> value.

Right.
 
> One possible reading is that you want to prevent hashing in !have_lease
> case of
> dn = splice_dentry(dn, in, _lease);
> If that's the case, you might have a problem, since it will be hashed no
> matter what...

In this case it doesn't actually matter if it is hashed or not, since 
we will look at the lease state on the dentry before trusting it...

This code dates back to when Ceph was originally upstreamed, so the 
history is murky, but I expect at that point I wanted to avoid hashing in 
the no-lease case.  But I don't think it matters.  We should just remove 
the prehash argument from splice_dentry entirely.

Zheng, does that sound right?

Thanks!
sage


[GIT PULL] Ceph fixes for -rc6

2016-02-26 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes for -rc6 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are two small messenger bug fixes and a log spam regression fix.

Thanks!
sage


Ilya Dryomov (3):
  libceph: don't bail early from try_read() when skipping a message
  libceph: use the right footer size when skipping a message
  libceph: don't spam dmesg with stray reply warnings

 net/ceph/messenger.c  | 15 +++
 net/ceph/osd_client.c |  4 ++--
 2 files changed, 13 insertions(+), 6 deletions(-)


[GIT PULL] Ceph fixes for -rc6

2016-02-26 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes for -rc6 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are two small messenger bug fixes and a log spam regression fix.

Thanks!
sage


Ilya Dryomov (3):
  libceph: don't bail early from try_read() when skipping a message
  libceph: use the right footer size when skipping a message
  libceph: don't spam dmesg with stray reply warnings

 net/ceph/messenger.c  | 15 +++
 net/ceph/osd_client.c |  4 ++--
 2 files changed, 13 insertions(+), 6 deletions(-)


[GIT PULL] Ceph fixes for -rc3

2016-02-05 Thread Sage Weil
Hi Linus,

Please pull the follow Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

We have a few wire protocol compatibility fixes, ports of a few recent 
CRUSH mapping changes, and a couple error path fixes.

Thanks!
sage



Dan Carpenter (1):
  ceph: checking for IS_ERR instead of NULL

Ilya Dryomov (6):
  crush: ensure bucket id is valid before indexing buckets array
  crush: ensure take bucket value is valid
  crush: add chooseleaf_stable tunable
  crush: decode and initialize chooseleaf_stable
  libceph: advertise support for TUNABLES5
  libceph: MOSDOpReply v7 encoding

Yan, Zheng (1):
  ceph: fix snap context leak in error path

 fs/ceph/file.c |  6 +++---
 include/linux/ceph/ceph_features.h | 16 +++-
 include/linux/crush/crush.h|  8 +++-
 net/ceph/crush/mapper.c| 33 ++---
 net/ceph/osd_client.c  | 10 ++
 net/ceph/osdmap.c  | 19 ++-
 6 files changed, 75 insertions(+), 17 deletions(-)


[GIT PULL] Ceph fixes for -rc3

2016-02-05 Thread Sage Weil
Hi Linus,

Please pull the follow Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

We have a few wire protocol compatibility fixes, ports of a few recent 
CRUSH mapping changes, and a couple error path fixes.

Thanks!
sage



Dan Carpenter (1):
  ceph: checking for IS_ERR instead of NULL

Ilya Dryomov (6):
  crush: ensure bucket id is valid before indexing buckets array
  crush: ensure take bucket value is valid
  crush: add chooseleaf_stable tunable
  crush: decode and initialize chooseleaf_stable
  libceph: advertise support for TUNABLES5
  libceph: MOSDOpReply v7 encoding

Yan, Zheng (1):
  ceph: fix snap context leak in error path

 fs/ceph/file.c |  6 +++---
 include/linux/ceph/ceph_features.h | 16 +++-
 include/linux/crush/crush.h|  8 +++-
 net/ceph/crush/mapper.c| 33 ++---
 net/ceph/osd_client.c  | 10 ++
 net/ceph/osdmap.c  | 19 ++-
 6 files changed, 75 insertions(+), 17 deletions(-)


[GIT PULL] Ceph updates for -rc1

2016-01-24 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates for 4.5-rc1 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The two main changes are aio support in CephFS, and a series that fixes 
several issues in the authentication key timeout/renewal code.  On top of 
that are a variety of cleanups and minor bug fixes.

Thanks!
sage


Geliang Tang (2):
  libceph: use list_next_entry instead of list_entry_next
  libceph: use list_for_each_entry_safe

Ilya Dryomov (6):
  libceph: fix ceph_msg_revoke()
  libceph: clear messenger auth_retry flag if we fault
  libceph: fix authorizer invalidation, take 2
  libceph: invalidate AUTH in addition to a service ticket
  libceph: kill off ceph_x_ticket_handler::validity
  libceph: remove outdated comment

Markus Elfring (1):
  rbd: delete an unnecessary check before rbd_dev_destroy()

Minfei Huang (1):
  ceph: Avoid to propagate the invalid page point

Yan, Zheng (4):
  ceph: fix double page_unlock() in page_mkwrite()
  ceph: Asynchronous IO support
  ceph: re-send AIO write request when getting -EOLDSNAP error
  ceph: use i_size_{read,write} to get/set i_size

Yaowei Bai (2):
  ceph: remove unused functions in ceph_frag.h
  ceph: ceph_frag_contains_value can be boolean

 drivers/block/rbd.c|   3 +-
 fs/ceph/addr.c |  14 +-
 fs/ceph/cache.c|   8 +-
 fs/ceph/file.c | 509 ++---
 fs/ceph/inode.c|   8 +-
 include/linux/ceph/ceph_frag.h |  37 +--
 include/linux/ceph/messenger.h |   2 +-
 net/ceph/auth_x.c  |  49 +++-
 net/ceph/auth_x.h  |   2 +-
 net/ceph/messenger.c   | 105 ++---
 net/ceph/mon_client.c  |   4 -
 11 files changed, 501 insertions(+), 240 deletions(-)


[GIT PULL] Ceph updates for -rc1

2016-01-24 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates for 4.5-rc1 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The two main changes are aio support in CephFS, and a series that fixes 
several issues in the authentication key timeout/renewal code.  On top of 
that are a variety of cleanups and minor bug fixes.

Thanks!
sage


Geliang Tang (2):
  libceph: use list_next_entry instead of list_entry_next
  libceph: use list_for_each_entry_safe

Ilya Dryomov (6):
  libceph: fix ceph_msg_revoke()
  libceph: clear messenger auth_retry flag if we fault
  libceph: fix authorizer invalidation, take 2
  libceph: invalidate AUTH in addition to a service ticket
  libceph: kill off ceph_x_ticket_handler::validity
  libceph: remove outdated comment

Markus Elfring (1):
  rbd: delete an unnecessary check before rbd_dev_destroy()

Minfei Huang (1):
  ceph: Avoid to propagate the invalid page point

Yan, Zheng (4):
  ceph: fix double page_unlock() in page_mkwrite()
  ceph: Asynchronous IO support
  ceph: re-send AIO write request when getting -EOLDSNAP error
  ceph: use i_size_{read,write} to get/set i_size

Yaowei Bai (2):
  ceph: remove unused functions in ceph_frag.h
  ceph: ceph_frag_contains_value can be boolean

 drivers/block/rbd.c|   3 +-
 fs/ceph/addr.c |  14 +-
 fs/ceph/cache.c|   8 +-
 fs/ceph/file.c | 509 ++---
 fs/ceph/inode.c|   8 +-
 include/linux/ceph/ceph_frag.h |  37 +--
 include/linux/ceph/messenger.h |   2 +-
 net/ceph/auth_x.c  |  49 +++-
 net/ceph/auth_x.h  |   2 +-
 net/ceph/messenger.c   | 105 ++---
 net/ceph/mon_client.c  |   4 -
 11 files changed, 501 insertions(+), 240 deletions(-)


[GIT PULL] Ceph update for -rc4

2015-12-04 Thread Sage Weil
Hi Linus,

Please pull the following fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This addresses a refcounting bug that leads to a use-after-free.

Thanks!
sage



Ilya Dryomov (1):
  rbd: don't put snap_context twice in rbd_queue_workfn()

 drivers/block/rbd.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph update for -rc4

2015-12-04 Thread Sage Weil
Hi Linus,

Please pull the following fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This addresses a refcounting bug that leads to a use-after-free.

Thanks!
sage



Ilya Dryomov (1):
  rbd: don't put snap_context twice in rbd_queue_workfn()

 drivers/block/rbd.c | 1 +
 1 file changed, 1 insertion(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph changes for -rc1

2015-11-13 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are several patches from Ilya fixing RBD allocation lifecycle 
issues, a series adding a nocephx_sign_messages option (and associated bug 
fixes/cleanups), several patches from Zheng improving the (directory) 
fsync behavior, a big improvement in IO for direct-io requests when 
striping is enabled from Caifeng, and several other small fixes and 
cleanups.

Thanks!
sage


Arnd Bergmann (1):
  ceph: fix message length computation

Geliang Tang (1):
  ceph: fix a comment typo

Ilya Dryomov (10):
  rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
  rbd: don't free rbd_dev outside of the release callback
  rbd: set device_type::release instead of device::release
  rbd: remove duplicate calls to rbd_dev_mapping_clear()
  libceph: introduce ceph_x_authorizer_cleanup()
  libceph: msg signing callouts don't need con argument
  libceph: drop authorizer check from cephx msg signing routines
  libceph: stop duplicating client fields in messenger
  libceph: add nocephx_sign_messages option
  libceph: clear msg->con in ceph_msg_release() only

Ioana Ciornei (1):
  libceph: evaluate osd_req_op_data() arguments only once

Julia Lawall (1):
  rbd: drop null test before destroy functions

Shraddha Barke (2):
  libceph: remove con argument in handle_reply()
  libceph: use local variable cursor instead of >cursor

Yan, Zheng (3):
  ceph: don't invalidate page cache when inode is no longer used
  ceph: add request to i_unsafe_dirops when getting unsafe reply
  ceph: make fsync() wait unsafe requests that created/modified inode

Zhu, Caifeng (1):
  ceph: combine as many iovec as possile into one OSD request

 drivers/block/rbd.c| 109 -
 fs/ceph/cache.c|   2 +-
 fs/ceph/caps.c |  76 ++--
 fs/ceph/file.c |  87 
 fs/ceph/inode.c|   1 +
 fs/ceph/mds_client.c   |  57 +++--
 fs/ceph/mds_client.h   |   3 ++
 fs/ceph/super.h|   1 +
 include/linux/ceph/libceph.h   |   4 +-
 include/linux/ceph/messenger.h |  16 ++
 net/ceph/auth_x.c  |  36 +-
 net/ceph/ceph_common.c |  18 +--
 net/ceph/crypto.h  |   4 +-
 net/ceph/messenger.c   |  88 ++---
 net/ceph/osd_client.c  |  34 ++---
 15 files changed, 314 insertions(+), 222 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph changes for -rc1

2015-11-13 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are several patches from Ilya fixing RBD allocation lifecycle 
issues, a series adding a nocephx_sign_messages option (and associated bug 
fixes/cleanups), several patches from Zheng improving the (directory) 
fsync behavior, a big improvement in IO for direct-io requests when 
striping is enabled from Caifeng, and several other small fixes and 
cleanups.

Thanks!
sage


Arnd Bergmann (1):
  ceph: fix message length computation

Geliang Tang (1):
  ceph: fix a comment typo

Ilya Dryomov (10):
  rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails
  rbd: don't free rbd_dev outside of the release callback
  rbd: set device_type::release instead of device::release
  rbd: remove duplicate calls to rbd_dev_mapping_clear()
  libceph: introduce ceph_x_authorizer_cleanup()
  libceph: msg signing callouts don't need con argument
  libceph: drop authorizer check from cephx msg signing routines
  libceph: stop duplicating client fields in messenger
  libceph: add nocephx_sign_messages option
  libceph: clear msg->con in ceph_msg_release() only

Ioana Ciornei (1):
  libceph: evaluate osd_req_op_data() arguments only once

Julia Lawall (1):
  rbd: drop null test before destroy functions

Shraddha Barke (2):
  libceph: remove con argument in handle_reply()
  libceph: use local variable cursor instead of >cursor

Yan, Zheng (3):
  ceph: don't invalidate page cache when inode is no longer used
  ceph: add request to i_unsafe_dirops when getting unsafe reply
  ceph: make fsync() wait unsafe requests that created/modified inode

Zhu, Caifeng (1):
  ceph: combine as many iovec as possile into one OSD request

 drivers/block/rbd.c| 109 -
 fs/ceph/cache.c|   2 +-
 fs/ceph/caps.c |  76 ++--
 fs/ceph/file.c |  87 
 fs/ceph/inode.c|   1 +
 fs/ceph/mds_client.c   |  57 +++--
 fs/ceph/mds_client.h   |   3 ++
 fs/ceph/super.h|   1 +
 include/linux/ceph/libceph.h   |   4 +-
 include/linux/ceph/messenger.h |  16 ++
 net/ceph/auth_x.c  |  36 +-
 net/ceph/ceph_common.c |  18 +--
 net/ceph/crypto.h  |   4 +-
 net/ceph/messenger.c   |  88 ++---
 net/ceph/osd_client.c  |  34 ++---
 15 files changed, 314 insertions(+), 222 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fix for 4.3

2015-10-30 Thread Sage Weil
Hi Linus,

Please pull the following RBD fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This sets the stable pages flag on the RBD block device when we have CRCs 
enabled.  (This is necessary since the default assumption for block 
devices changed in 3.9.)

Thanks!
sage



Ronny Hegewald (1):
  rbd: require stable pages if message data CRCs are enabled

 drivers/block/rbd.c | 3 +++
 1 file changed, 3 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fix for 4.3

2015-10-30 Thread Sage Weil
Hi Linus,

Please pull the following RBD fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This sets the stable pages flag on the RBD block device when we have CRCs 
enabled.  (This is necessary since the default assumption for block 
devices changed in 3.9.)

Thanks!
sage



Ronny Hegewald (1):
  rbd: require stable pages if message data CRCs are enabled

 drivers/block/rbd.c | 3 +++
 1 file changed, 3 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for -rc7

2015-10-23 Thread Sage Weil
Hi Linus,

Please pull the following two patches from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

One is a stopgap to prevent a stack blowout when users have a deep chain 
of image clones.  (We'll rewrite this code to be non-recursive for the 
next window, but in the meantime this is a simple fix that avoids a 
crash.)  The second fixes a refcount underflow.

Thanks!
sage


Ilya Dryomov (2):
  rbd: don't leak parent_spec in rbd_dev_probe_parent()
  rbd: prevent kernel stack blow up on rbd map

 drivers/block/rbd.c | 69 ++---
 1 file changed, 39 insertions(+), 30 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for -rc7

2015-10-23 Thread Sage Weil
Hi Linus,

Please pull the following two patches from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

One is a stopgap to prevent a stack blowout when users have a deep chain 
of image clones.  (We'll rewrite this code to be non-recursive for the 
next window, but in the meantime this is a simple fix that avoids a 
crash.)  The second fixes a refcount underflow.

Thanks!
sage


Ilya Dryomov (2):
  rbd: don't leak parent_spec in rbd_dev_probe_parent()
  rbd: prevent kernel stack blow up on rbd map

 drivers/block/rbd.c | 69 ++---
 1 file changed, 39 insertions(+), 30 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc6

2015-10-16 Thread Sage Weil
Hi Linus,

The following changes since commit 25cb62b76430a91cc6195f902e61c2cb84ade622:

  Linux 4.3-rc5 (2015-10-11 11:09:45 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

for you to fetch changes up to e30b7577bf1d338ca8a273bd2f881de5a41572b7:

  rbd: use writefull op for object size writes (2015-10-16 16:49:01 +0200)

Just two small items from Ilya:

The first patch fixes the RBD readahead to grab full objects.  The second 
fixes the write ops to prevent undue promotion when a cache tier is 
configured on the server side.

Thanks!
sage


Ilya Dryomov (2):
  rbd: set max_sectors explicitly
  rbd: use writefull op for object size writes

 drivers/block/rbd.c   | 10 --
 net/ceph/osd_client.c | 13 +
 2 files changed, 17 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc6

2015-10-16 Thread Sage Weil
Hi Linus,

The following changes since commit 25cb62b76430a91cc6195f902e61c2cb84ade622:

  Linux 4.3-rc5 (2015-10-11 11:09:45 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

for you to fetch changes up to e30b7577bf1d338ca8a273bd2f881de5a41572b7:

  rbd: use writefull op for object size writes (2015-10-16 16:49:01 +0200)

Just two small items from Ilya:

The first patch fixes the RBD readahead to grab full objects.  The second 
fixes the write ops to prevent undue promotion when a cache tier is 
configured on the server side.

Thanks!
sage


Ilya Dryomov (2):
  rbd: set max_sectors explicitly
  rbd: use writefull op for object size writes

 drivers/block/rbd.c   | 10 --
 net/ceph/osd_client.c | 13 +
 2 files changed, 17 insertions(+), 6 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-13 Thread Sage Weil
On Tue, 13 Oct 2015, Grant Likely wrote:
> On 11 Oct 2015 05:20, "Ric Wheeler"  wrote:
> >
> > I would like to nominate Sage Weil with his consent.
> >
> > Sage has lead the ceph project since its inception, contributed to the
> kernel as well as had an influence on projects like openstack.
> 
> Sage, what say you? Do you accept your nomination?

I do!

Thanks-
sage

> 
> g.
> 
> >
> > thanks!
> >
> > Ric
> >
> >
> >
> >
> > On 10/06/2015 01:06 PM, Grant Likely wrote:
> >>
> >> [Resending because I messed up the first one]
> >>
> >> The elections for five of the ten members of the Linux Foundation
> >> Technical Advisory Board (TAB) are held every year[1]. This year the
> >> election will be at the 2015 Kernel Summit in Seoul, South Korea
> >> (probably on the Monday, 26 October) and will be open to all attendees
> >> of both Kernel Summit and Korea Linux Forum.
> >>
> >> Anyone is eligible to stand for election, simply send your nomination to:
> >>
> >> tech-board-disc...@lists.linux-foundation.org
> >>
> >> We currently have 3 nominees for five places:
> >> Thomas Gleixner
> >> Greg Kroah-Hartman
> >> Stephen Hemminger
> >>
> >> The deadline for receiving nominations is up until the beginning of
> >> the event where the election is held. Although, please remember if
> >> you're not going to be present that things go wrong with both networks
> >> and mailing lists, so get your nomination in early).
> >>
> >> Grant Likely, TAB Chair
> >>
> >> [1] TAB members sit for a term of 2 years, and half of the board is up
> >> for election every year. Five of the seats are up for election now.
> >> The other five are half way through their term and will be up for
> >> election next year. The history of the TAB elections can be found
> >> here:
> >>
> >>https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_
> YQoTNCKA/edit#gid=0
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
> 
> 
> 

Re: Linux Foundation Technical Advisory Board Elections and Nomination process

2015-10-13 Thread Sage Weil
On Tue, 13 Oct 2015, Grant Likely wrote:
> On 11 Oct 2015 05:20, "Ric Wheeler" <ricwhee...@gmail.com> wrote:
> >
> > I would like to nominate Sage Weil with his consent.
> >
> > Sage has lead the ceph project since its inception, contributed to the
> kernel as well as had an influence on projects like openstack.
> 
> Sage, what say you? Do you accept your nomination?

I do!

Thanks-
sage

> 
> g.
> 
> >
> > thanks!
> >
> > Ric
> >
> >
> >
> >
> > On 10/06/2015 01:06 PM, Grant Likely wrote:
> >>
> >> [Resending because I messed up the first one]
> >>
> >> The elections for five of the ten members of the Linux Foundation
> >> Technical Advisory Board (TAB) are held every year[1]. This year the
> >> election will be at the 2015 Kernel Summit in Seoul, South Korea
> >> (probably on the Monday, 26 October) and will be open to all attendees
> >> of both Kernel Summit and Korea Linux Forum.
> >>
> >> Anyone is eligible to stand for election, simply send your nomination to:
> >>
> >> tech-board-disc...@lists.linux-foundation.org
> >>
> >> We currently have 3 nominees for five places:
> >> Thomas Gleixner
> >> Greg Kroah-Hartman
> >> Stephen Hemminger
> >>
> >> The deadline for receiving nominations is up until the beginning of
> >> the event where the election is held. Although, please remember if
> >> you're not going to be present that things go wrong with both networks
> >> and mailing lists, so get your nomination in early).
> >>
> >> Grant Likely, TAB Chair
> >>
> >> [1] TAB members sit for a term of 2 years, and half of the board is up
> >> for election every year. Five of the seats are up for election now.
> >> The other five are half way through their term and will be up for
> >> election next year. The history of the TAB elections can be found
> >> here:
> >>
> >>https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_
> YQoTNCKA/edit#gid=0
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at  http://www.tux.org/lkml/
> >
> >
> 
> 
> 

[GIT PULL] Ceph fixes for -rc2

2015-09-17 Thread Sage Weil
Hi Linus,

The following changes since commit 6ff33f3902c3b1c5d0db6b1e2c70b6d76fba357f:

  Linux 4.3-rc1 (2015-09-12 16:35:56 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

for you to fetch changes up to 335c25858218e76ef47f92ecb9d22e919d36140d:

  libceph: advertise support for keepalive2 (2015-09-17 20:14:27 +0300)

These are both fixes to the new and improved keepalive2 behavior.

Thanks!
sage


Ilya Dryomov (2):
  libceph: don't access invalid memory in keepalive2 path
  libceph: advertise support for keepalive2

 include/linux/ceph/ceph_features.h | 1 +
 include/linux/ceph/messenger.h | 4 +++-
 net/ceph/messenger.c   | 9 +
 3 files changed, 9 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc2

2015-09-17 Thread Sage Weil
Hi Linus,

The following changes since commit 6ff33f3902c3b1c5d0db6b1e2c70b6d76fba357f:

  Linux 4.3-rc1 (2015-09-12 16:35:56 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

for you to fetch changes up to 335c25858218e76ef47f92ecb9d22e919d36140d:

  libceph: advertise support for keepalive2 (2015-09-17 20:14:27 +0300)

These are both fixes to the new and improved keepalive2 behavior.

Thanks!
sage


Ilya Dryomov (2):
  libceph: don't access invalid memory in keepalive2 path
  libceph: advertise support for keepalive2

 include/linux/ceph/ceph_features.h | 1 +
 include/linux/ceph/messenger.h | 4 +++-
 net/ceph/messenger.c   | 9 +
 3 files changed, 9 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph changes for 4.3-rc1

2015-09-11 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are a few fixes for snapshot behavior with CephFS and support for 
the new keepalive protocol from Zheng, a libceph fix that affects both RBD 
and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several 
small fixes and cleanups from Jianpeng and others.

Thanks!
sage



Benoît Canet (1):
  libceph: Avoid holding the zero page on ceph_msgr_slab_init errors

Brad Hubbard (1):
  ceph: remove redundant test of head->safe and silence static analysis 
warnings

Ilya Dryomov (4):
  libceph: rename con_work() to ceph_con_workfn()
  rbd: fix double free on rbd_dev->header_name
  rbd: plug rbd_dev->header.object_prefix memory leak
  libceph: check data_len in ->alloc_msg()

Jianpeng Ma (3):
  ceph: remove the useless judgement
  ceph: no need to get parent inode in ceph_open
  ceph: cleanup use of ceph_msg_get

Nicholas Krause (1):
  libceph: remove the unused macro AES_KEY_SIZE

Yan, Zheng (7):
  ceph: EIO all operations after forced umount
  ceph: invalidate dirty pages after forced umount
  ceph: fix queuing inode to mdsdir's snaprealm
  libceph: set 'exists' flag for newly up osd
  libceph: use keepalive2 to verify the mon session is alive
  ceph: get inode size for each append write
  ceph: improve readahead for file holes

 drivers/block/rbd.c|  6 ++--
 fs/ceph/addr.c |  6 ++--
 fs/ceph/caps.c |  8 +
 fs/ceph/file.c | 14 
 fs/ceph/mds_client.c   | 59 ++
 fs/ceph/mds_client.h   |  1 +
 fs/ceph/snap.c |  7 
 fs/ceph/super.c|  1 +
 include/linux/ceph/libceph.h   |  2 ++
 include/linux/ceph/messenger.h |  4 +++
 include/linux/ceph/msgr.h  |  4 ++-
 net/ceph/ceph_common.c |  1 +
 net/ceph/crypto.c  |  4 ---
 net/ceph/messenger.c   | 82 +++---
 net/ceph/mon_client.c  | 37 ++-
 net/ceph/osd_client.c  | 51 ++
 net/ceph/osdmap.c  |  2 +-
 17 files changed, 191 insertions(+), 98 deletions(-)

[GIT PULL] Ceph changes for 4.3-rc1

2015-09-11 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are a few fixes for snapshot behavior with CephFS and support for 
the new keepalive protocol from Zheng, a libceph fix that affects both RBD 
and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several 
small fixes and cleanups from Jianpeng and others.

Thanks!
sage



Benoît Canet (1):
  libceph: Avoid holding the zero page on ceph_msgr_slab_init errors

Brad Hubbard (1):
  ceph: remove redundant test of head->safe and silence static analysis 
warnings

Ilya Dryomov (4):
  libceph: rename con_work() to ceph_con_workfn()
  rbd: fix double free on rbd_dev->header_name
  rbd: plug rbd_dev->header.object_prefix memory leak
  libceph: check data_len in ->alloc_msg()

Jianpeng Ma (3):
  ceph: remove the useless judgement
  ceph: no need to get parent inode in ceph_open
  ceph: cleanup use of ceph_msg_get

Nicholas Krause (1):
  libceph: remove the unused macro AES_KEY_SIZE

Yan, Zheng (7):
  ceph: EIO all operations after forced umount
  ceph: invalidate dirty pages after forced umount
  ceph: fix queuing inode to mdsdir's snaprealm
  libceph: set 'exists' flag for newly up osd
  libceph: use keepalive2 to verify the mon session is alive
  ceph: get inode size for each append write
  ceph: improve readahead for file holes

 drivers/block/rbd.c|  6 ++--
 fs/ceph/addr.c |  6 ++--
 fs/ceph/caps.c |  8 +
 fs/ceph/file.c | 14 
 fs/ceph/mds_client.c   | 59 ++
 fs/ceph/mds_client.h   |  1 +
 fs/ceph/snap.c |  7 
 fs/ceph/super.c|  1 +
 include/linux/ceph/libceph.h   |  2 ++
 include/linux/ceph/messenger.h |  4 +++
 include/linux/ceph/msgr.h  |  4 ++-
 net/ceph/ceph_common.c |  1 +
 net/ceph/crypto.c  |  4 ---
 net/ceph/messenger.c   | 82 +++---
 net/ceph/mon_client.c  | 37 ++-
 net/ceph/osd_client.c  | 51 ++
 net/ceph/osdmap.c  |  2 +-
 17 files changed, 191 insertions(+), 98 deletions(-)

[GIT PULL] Ceph fixes for -rc6

2015-08-03 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are two critical regression fixes for CephFS from Zheng, and an RBD 
completion fix for layered images from Ilya.

(Note: git request-pull is complaining that the for-linus branch isn't 
referencing the right commit even though it is... hopefully I'm not doing 
something stupid.  The right commit is 
2761713d35e370fd640b5781109f753066b746c4.)

Thanks!
sage


Ilya Dryomov (1):
  rbd: fix copyup completion race

Yan, Zheng (2):
  ceph: fix ceph_encode_locks_to_buffer()
  ceph: always re-send cap flushes when MDS recovers

 drivers/block/rbd.c | 22 +-
 fs/ceph/caps.c  | 22 +-
 fs/ceph/locks.c |  2 +-
 fs/ceph/super.h |  1 -
 4 files changed, 23 insertions(+), 24 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc6

2015-08-03 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There are two critical regression fixes for CephFS from Zheng, and an RBD 
completion fix for layered images from Ilya.

(Note: git request-pull is complaining that the for-linus branch isn't 
referencing the right commit even though it is... hopefully I'm not doing 
something stupid.  The right commit is 
2761713d35e370fd640b5781109f753066b746c4.)

Thanks!
sage


Ilya Dryomov (1):
  rbd: fix copyup completion race

Yan, Zheng (2):
  ceph: fix ceph_encode_locks_to_buffer()
  ceph: always re-send cap flushes when MDS recovers

 drivers/block/rbd.c | 22 +-
 fs/ceph/caps.c  | 22 +-
 fs/ceph/locks.c |  2 +-
 fs/ceph/super.h |  1 -
 4 files changed, 23 insertions(+), 24 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc2

2015-07-09 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is a fix for CephFS and RBD when used within containers/namespaces, 
and a fix for the address learning the client is supposed to do when 
initially talking to the Ceph cluster.

There are also two patches updating MAINTAINERS.  One breaks out the 
common Ceph code shared by fs/ceph and drivers/block/rbd.c into a separate 
entry with the appropriate maintainers listed.  The second adds a second 
reference to the github tree where the Ceph client development takes place 
(before it is pushed to korg and then to you).  The goal here is to move 
closer to a situation where Ilya Dryomov or one of the other maintainers 
can push things to you if I am unavailable.  Ilya has done most of the 
work preparing branches for upstream recently; you should not be surprised 
to hear from him if I am trapped in some internet-less wasteland or hit by 
a bus or something.  In the meantime, we'll work on getting him added to 
the kernel web of trust.

Thanks-
sage



Ilya Dryomov (2):
  libceph: enable ceph in a non-default network namespace
  libceph: treat sockaddr_storage with uninitialized family as blank

Sage Weil (2):
  MAINTAINERS: update ceph entries
  MAINTAINERS: add secondary tree for ceph modules

 MAINTAINERS| 22 ++
 include/linux/ceph/messenger.h |  3 +++
 net/ceph/ceph_common.c | 16 ++--
 net/ceph/messenger.c   | 24 
 4 files changed, 47 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc2

2015-07-09 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is a fix for CephFS and RBD when used within containers/namespaces, 
and a fix for the address learning the client is supposed to do when 
initially talking to the Ceph cluster.

There are also two patches updating MAINTAINERS.  One breaks out the 
common Ceph code shared by fs/ceph and drivers/block/rbd.c into a separate 
entry with the appropriate maintainers listed.  The second adds a second 
reference to the github tree where the Ceph client development takes place 
(before it is pushed to korg and then to you).  The goal here is to move 
closer to a situation where Ilya Dryomov or one of the other maintainers 
can push things to you if I am unavailable.  Ilya has done most of the 
work preparing branches for upstream recently; you should not be surprised 
to hear from him if I am trapped in some internet-less wasteland or hit by 
a bus or something.  In the meantime, we'll work on getting him added to 
the kernel web of trust.

Thanks-
sage



Ilya Dryomov (2):
  libceph: enable ceph in a non-default network namespace
  libceph: treat sockaddr_storage with uninitialized family as blank

Sage Weil (2):
  MAINTAINERS: update ceph entries
  MAINTAINERS: add secondary tree for ceph modules

 MAINTAINERS| 22 ++
 include/linux/ceph/messenger.h |  3 +++
 net/ceph/ceph_common.c | 16 ++--
 net/ceph/messenger.c   | 24 
 4 files changed, 47 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for -rc1

2015-07-02 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

We have a pile of bug fixes from Ilya, including a few patches that sync 
up the CRUSH code with the latest from userspace.  There is also a long 
series from Zheng that fixes various issues with snapshots, inline data, 
and directory fsync, some simplification and improvement in the cap 
release code, and a rework of the caching of directory contents.  To top 
it off there are a few small fixes and cleanups from Benoit and Hong.

Thanks!
sage



Benoît Canet (2):
  libceph: Remove spurious kunmap() of the zero page
  libceph: Fix ceph_tcp_sendpage()'s more boolean usage

Hong Zhiguo (1):
  libceph: fix wrong name "Ceph filesystem for Linux"

Ilya Dryomov (14):
  libceph: use kvfree() instead of open-coding it
  libceph: nuke time_sub()
  libceph: store timeouts in jiffies, verify user input
  libceph: a couple tweaks for wait loops
  ceph: simplify two mount_timeout sites
  rbd: timeout watch teardown on unmap with mount_timeout
  crush: fix crash from invalid 'take' argument
  crush: sync up with userspace
  rbd: bump queue_max_segments
  rbd: terminate rbd_opts_tokens with Opt_err
  rbd: store rbd_options in rbd_device
  rbd: queue_depth map option
  crush: fix a bug in tree bucket decode
  rbd: use GFP_NOIO in rbd_obj_request_create()

Yan, Zheng (23):
  libceph: properly release STAT request's raw_data_in
  libceph: allow setting osd_req_op's flags
  ceph: check OSD caps before read/write
  ceph: use empty snap context for uninline_data and get_pool_perm
  ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference
  ceph: avoid sending unnessesary FLUSHSNAP message
  ceph: take snap_rwsem when accessing snap realm's cached_context
  ceph: don't trim auth cap when there are cap snaps
  ceph: make sure syncfs flushes all cap snaps
  ceph: don't pre-allocate space for cap release messages
  ceph: exclude setfilelock requests when calculating oldest tid
  ceph: ratelimit warn messages for MDS closes session
  ceph: don't include used caps in cap_wanted
  ceph: fix flushing caps
  ceph: fix directory fsync
  ceph: track pending caps flushing accurately
  ceph: track pending caps flushing globally
  ceph: send TID of the oldest pending caps flush to MDS
  ceph: re-send flushing caps (which are revoked) in reconnect stage
  ceph: pre-allocate data structure that tracks caps flushing
  ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL
  ceph: rework dcache readdir
  ceph: fix ceph_writepages_start()

 drivers/block/rbd.c | 111 --
 fs/ceph/acl.c   |   4 +-
 fs/ceph/addr.c  | 308 ---
 fs/ceph/caps.c  | 836 +++-
 fs/ceph/dir.c   | 383 --
 fs/ceph/file.c  |  61 ++-
 fs/ceph/inode.c | 155 ++--
 fs/ceph/mds_client.c| 425 +++-
 fs/ceph/mds_client.h|  23 +-
 fs/ceph/snap.c  | 173 +
 fs/ceph/super.c |  25 +-
 fs/ceph/super.h | 125 +++---
 fs/ceph/xattr.c |  65 +++-
 include/linux/ceph/libceph.h|  21 +-
 include/linux/ceph/osd_client.h |   2 +-
 include/linux/crush/crush.h |  40 +-
 include/linux/crush/hash.h  |   6 +
 include/linux/crush/mapper.h|   2 +-
 net/ceph/ceph_common.c  |  50 ++-
 net/ceph/crush/crush.c  |  13 +-
 net/ceph/crush/crush_ln_table.h |  32 +-
 net/ceph/crush/hash.c   |   8 +-
 net/ceph/crush/mapper.c | 148 ---
 net/ceph/messenger.c|   3 +-
 net/ceph/mon_client.c   |  13 +-
 net/ceph/osd_client.c   |  42 +-
 net/ceph/osdmap.c   |   2 +-
 net/ceph/pagevec.c  |   5 +-
 28 files changed, 2010 insertions(+), 1071 deletions(-)

[GIT PULL] Ceph updates for -rc1

2015-07-02 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

We have a pile of bug fixes from Ilya, including a few patches that sync 
up the CRUSH code with the latest from userspace.  There is also a long 
series from Zheng that fixes various issues with snapshots, inline data, 
and directory fsync, some simplification and improvement in the cap 
release code, and a rework of the caching of directory contents.  To top 
it off there are a few small fixes and cleanups from Benoit and Hong.

Thanks!
sage



Benoît Canet (2):
  libceph: Remove spurious kunmap() of the zero page
  libceph: Fix ceph_tcp_sendpage()'s more boolean usage

Hong Zhiguo (1):
  libceph: fix wrong name Ceph filesystem for Linux

Ilya Dryomov (14):
  libceph: use kvfree() instead of open-coding it
  libceph: nuke time_sub()
  libceph: store timeouts in jiffies, verify user input
  libceph: a couple tweaks for wait loops
  ceph: simplify two mount_timeout sites
  rbd: timeout watch teardown on unmap with mount_timeout
  crush: fix crash from invalid 'take' argument
  crush: sync up with userspace
  rbd: bump queue_max_segments
  rbd: terminate rbd_opts_tokens with Opt_err
  rbd: store rbd_options in rbd_device
  rbd: queue_depth map option
  crush: fix a bug in tree bucket decode
  rbd: use GFP_NOIO in rbd_obj_request_create()

Yan, Zheng (23):
  libceph: properly release STAT request's raw_data_in
  libceph: allow setting osd_req_op's flags
  ceph: check OSD caps before read/write
  ceph: use empty snap context for uninline_data and get_pool_perm
  ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference
  ceph: avoid sending unnessesary FLUSHSNAP message
  ceph: take snap_rwsem when accessing snap realm's cached_context
  ceph: don't trim auth cap when there are cap snaps
  ceph: make sure syncfs flushes all cap snaps
  ceph: don't pre-allocate space for cap release messages
  ceph: exclude setfilelock requests when calculating oldest tid
  ceph: ratelimit warn messages for MDS closes session
  ceph: don't include used caps in cap_wanted
  ceph: fix flushing caps
  ceph: fix directory fsync
  ceph: track pending caps flushing accurately
  ceph: track pending caps flushing globally
  ceph: send TID of the oldest pending caps flush to MDS
  ceph: re-send flushing caps (which are revoked) in reconnect stage
  ceph: pre-allocate data structure that tracks caps flushing
  ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL
  ceph: rework dcache readdir
  ceph: fix ceph_writepages_start()

 drivers/block/rbd.c | 111 --
 fs/ceph/acl.c   |   4 +-
 fs/ceph/addr.c  | 308 ---
 fs/ceph/caps.c  | 836 +++-
 fs/ceph/dir.c   | 383 --
 fs/ceph/file.c  |  61 ++-
 fs/ceph/inode.c | 155 ++--
 fs/ceph/mds_client.c| 425 +++-
 fs/ceph/mds_client.h|  23 +-
 fs/ceph/snap.c  | 173 +
 fs/ceph/super.c |  25 +-
 fs/ceph/super.h | 125 +++---
 fs/ceph/xattr.c |  65 +++-
 include/linux/ceph/libceph.h|  21 +-
 include/linux/ceph/osd_client.h |   2 +-
 include/linux/crush/crush.h |  40 +-
 include/linux/crush/hash.h  |   6 +
 include/linux/crush/mapper.h|   2 +-
 net/ceph/ceph_common.c  |  50 ++-
 net/ceph/crush/crush.c  |  13 +-
 net/ceph/crush/crush_ln_table.h |  32 +-
 net/ceph/crush/hash.c   |   8 +-
 net/ceph/crush/mapper.c | 148 ---
 net/ceph/messenger.c|   3 +-
 net/ceph/mon_client.c   |  13 +-
 net/ceph/osd_client.c   |  42 +-
 net/ceph/osdmap.c   |   2 +-
 net/ceph/pagevec.c  |   5 +-
 28 files changed, 2010 insertions(+), 1071 deletions(-)

Re: [GIT PULL] Ceph fixes for -rc5

2015-05-22 Thread Sage Weil
On Fri, 22 May 2015, Linus Torvalds wrote:
> On Fri, May 22, 2015 at 5:13 PM, Sage Weil  wrote:
> > Hi Linus,
> >
> > Please pull the following fixes from
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
> 
> Nothing there.
> 
> Did you perhaps mean the "for-linus" branch?
> 
> Please fix whatever script it is you use that generates bad pull requests.

Bah, I forgot to push the for-linus branch--it's there now.  Sorry!

(BTW, git://git.kernel.org is going really slowly today... :/)

Thanks-
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc5

2015-05-22 Thread Sage Weil
Hi Linus,

Please pull the following fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 

These fix an issue with the RBD notifications when there are topology 
changes in the cluster.

Thanks!
sage



Ilya Dryomov (2):
  libceph: request a new osdmap if lingering request maps to no osd
  Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"

 net/ceph/osd_client.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc5

2015-05-22 Thread Sage Weil
Hi Linus,

Please pull the following fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 

These fix an issue with the RBD notifications when there are topology 
changes in the cluster.

Thanks!
sage



Ilya Dryomov (2):
  libceph: request a new osdmap if lingering request maps to no osd
  Revert libceph: clear r_req_lru_item in __unregister_linger_request()

 net/ceph/osd_client.c | 33 -
 1 file changed, 20 insertions(+), 13 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] Ceph fixes for -rc5

2015-05-22 Thread Sage Weil
On Fri, 22 May 2015, Linus Torvalds wrote:
 On Fri, May 22, 2015 at 5:13 PM, Sage Weil sw...@redhat.com wrote:
  Hi Linus,
 
  Please pull the following fixes from
 
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
 
 Nothing there.
 
 Did you perhaps mean the for-linus branch?
 
 Please fix whatever script it is you use that generates bad pull requests.

Bah, I forgot to push the for-linus branch--it's there now.  Sorry!

(BTW, git://git.kernel.org is going really slowly today... :/)

Thanks-
sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread Sage Weil
On Tue, 12 May 2015, Dave Chinner wrote:
> > > Neither of these examples cases are under the control of the 
> > > application that calls open(O_NOMTIME).
> > 
> > Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
> > nodes provisioned explicitly to run these systems would be enable this 
> > option.
> 
> Back to my Joe Speedracer comments.
> 
> I'm not sure what the right answer is - mount options are simply too
> easy to add without understanding the full implications of them.
> e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was
> too dangerous for unsuspecting users. This isn't at that same level
> or concern, but it's still a landmine we want to avoid users from
> arming without realising it...
> 
> > > >> I'm happy for it to be an ioctl interface - even an XFS specific
> > > >> interface if you want to go that route, Sage - and it probably
> > > >> should emit a warning to syslog first time it is used so there is
> > > >> trace for bug triage purposes. i.e. we know the app is not using
> > > >> mtime updates, so bug reports that are the result of mtime
> > > >> mishandling don't result in large amounts of wasted developer time
> > > >> trying to understand them...
> > > >
> > > > A warning on using the interface (or when mounting with user_nomtime)
> > > > sounds reasonable.
> > > >
> > > > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > > > f2fs, possibly btrfs) would similarly benefit.  (And if we want to 
> > > > target
> > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as 
> > > > it
> > > > already does O_NOMTIME unconditionally.)
> > > 
> > > Lack of a namespace, doesn't imply that you don't want to manage the
> > > data. The whole point of using object storage instead of plain old
> > > block storage is to be able to provide whatever metadata you still
> > > need in order to manage the object.
> > 
> > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
> > like to use) doesn't assume O_NOMTIME.
> 
> Right - the XFS ioctls were designed specifically for applications
> that interacted directly with the structure of XFS filesystems and
> so needed invisible IO (e.g. online defragmenter). IOWs, they are
> not interfaces intended for general usage. They are also only
> available to root, so a typical user application won't be making use
> of them, either.

I understand that's what they're intended for, but I'm having a hard time 
parsing out the difference between what they *do* and what O_NOMTIME + -o 
allow_nomtime does.  The open-by-handle ioctls have nothing to do with the 
online XFS format--they simply allow you to open a file via an opaque 
handle (albeit a differently formatted one than the generic 
open_by_handle_at(2)).  They also force you into an O_NOMTIME-equivalent 
mode.

AFAICS the only difference that I see is that

1) the ioctl is XFS specific.  (As open_by_handle_at(2) demonstrates, this 
needn't be the case.)

2) the NOMTIME mode is only available via the open-by-handle interface, 
not open(2).

3) it is an ioctl interface, and thus more obscure.  (Well, there is a 
libhandle library, but it doesn't seem to be widely used.)

Would you object less if 

1) the O_NOMTIME flag were only available via open_by_handle_at(2)?

2) an equivalent ioctl were implemented for each file system of interest 
that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME 
flag?

3) O_NOMTIME required root (vs a mount option that requires root and 
unpriviledged O_NOMTIME)?

Just trying to tease apart which part is problematic...

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread Sage Weil
On Tue, 12 May 2015, Kevin Easton wrote:
> On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
> > On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
> > > > Let me re-ask the question that I asked last week (and was apparently
> > > > ignored).  Why not trying to use the lazytime feature instead of
> > > > pointing a head straight at the application's --- and system
> > > > administrators' --- heads?
> > > 
> > > Sorry Ted, I thought I responded already.
> > > 
> > > The goal is to avoid inode writeout entirely when we can, and 
> > > as I understand it lazytime will still force writeout before the inode 
> > > is dropped from the cache.  In systems like Ceph in particular, the 
> > > IOs can be spread across lots of files, so simply deferring writeout 
> > > doesn't always help.
> > 
> > Sure, but it would reduce the writeout by orders of magnitude.  I can
> > understand if you want to reduce it further, but it might be good
> > enough for your purposes.
> > 
> > I considered doing the equivalent of O_NOMTIME for our purposes at
> > $WORK, and our use case is actually not that different from Ceph's
> > (i.e., using a local disk file system to support a cluster file
> > system), and lazytime was (a) something I figured was something I
> > could upstream in good conscience, and (b) was more than good enough
> > for us.
> 
> A safer alternative might be a chattr file attribute that if set, the
> mtime is not updated on writes, and stat() on the file always shows the
> mtime as "right now".  At least that way, the file won't accidentally
> get left out of backups that rely on the mtime.
> 
> (If the file attribute is unset, you immediately update the mtime then
> too, and from then on the file is back to normal).

Interesting!  I didn't realize there was already a chattr +A that disabled 
atime (although I suspect it doesn't do the "right now" for stat thing). 
This makes the nomtime-ness a bit more obscure (I don't think most users 
would think to check these file attributes), but it's a safer failure 
condition for backups at least.

The fact that chattr +A (and hopefully +M) will work for non-root is a 
bonus, as we're also trying to get ceph daemons to drop most privileges.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread Sage Weil
On Tue, 12 May 2015, Kevin Easton wrote:
 On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote:
  On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote:
Let me re-ask the question that I asked last week (and was apparently
ignored).  Why not trying to use the lazytime feature instead of
pointing a head straight at the application's --- and system
administrators' --- heads?
   
   Sorry Ted, I thought I responded already.
   
   The goal is to avoid inode writeout entirely when we can, and 
   as I understand it lazytime will still force writeout before the inode 
   is dropped from the cache.  In systems like Ceph in particular, the 
   IOs can be spread across lots of files, so simply deferring writeout 
   doesn't always help.
  
  Sure, but it would reduce the writeout by orders of magnitude.  I can
  understand if you want to reduce it further, but it might be good
  enough for your purposes.
  
  I considered doing the equivalent of O_NOMTIME for our purposes at
  $WORK, and our use case is actually not that different from Ceph's
  (i.e., using a local disk file system to support a cluster file
  system), and lazytime was (a) something I figured was something I
  could upstream in good conscience, and (b) was more than good enough
  for us.
 
 A safer alternative might be a chattr file attribute that if set, the
 mtime is not updated on writes, and stat() on the file always shows the
 mtime as right now.  At least that way, the file won't accidentally
 get left out of backups that rely on the mtime.
 
 (If the file attribute is unset, you immediately update the mtime then
 too, and from then on the file is back to normal).

Interesting!  I didn't realize there was already a chattr +A that disabled 
atime (although I suspect it doesn't do the right now for stat thing). 
This makes the nomtime-ness a bit more obscure (I don't think most users 
would think to check these file attributes), but it's a safer failure 
condition for backups at least.

The fact that chattr +A (and hopefully +M) will work for non-root is a 
bonus, as we're also trying to get ceph daemons to drop most privileges.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-12 Thread Sage Weil
On Tue, 12 May 2015, Dave Chinner wrote:
   Neither of these examples cases are under the control of the 
   application that calls open(O_NOMTIME).
  
  Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
  nodes provisioned explicitly to run these systems would be enable this 
  option.
 
 Back to my Joe Speedracer comments.
 
 I'm not sure what the right answer is - mount options are simply too
 easy to add without understanding the full implications of them.
 e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was
 too dangerous for unsuspecting users. This isn't at that same level
 or concern, but it's still a landmine we want to avoid users from
 arming without realising it...
 
I'm happy for it to be an ioctl interface - even an XFS specific
interface if you want to go that route, Sage - and it probably
should emit a warning to syslog first time it is used so there is
trace for bug triage purposes. i.e. we know the app is not using
mtime updates, so bug reports that are the result of mtime
mishandling don't result in large amounts of wasted developer time
trying to understand them...
   
A warning on using the interface (or when mounting with user_nomtime)
sounds reasonable.
   
I'd rather not make this XFS specific as other local filesystmes (ext4,
f2fs, possibly btrfs) would similarly benefit.  (And if we want to 
target
XFS specifically the existing XFS open-by-handle ioctl is sufficient as 
it
already does O_NOMTIME unconditionally.)
   
   Lack of a namespace, doesn't imply that you don't want to manage the
   data. The whole point of using object storage instead of plain old
   block storage is to be able to provide whatever metadata you still
   need in order to manage the object.
  
  Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
  like to use) doesn't assume O_NOMTIME.
 
 Right - the XFS ioctls were designed specifically for applications
 that interacted directly with the structure of XFS filesystems and
 so needed invisible IO (e.g. online defragmenter). IOWs, they are
 not interfaces intended for general usage. They are also only
 available to root, so a typical user application won't be making use
 of them, either.

I understand that's what they're intended for, but I'm having a hard time 
parsing out the difference between what they *do* and what O_NOMTIME + -o 
allow_nomtime does.  The open-by-handle ioctls have nothing to do with the 
online XFS format--they simply allow you to open a file via an opaque 
handle (albeit a differently formatted one than the generic 
open_by_handle_at(2)).  They also force you into an O_NOMTIME-equivalent 
mode.

AFAICS the only difference that I see is that

1) the ioctl is XFS specific.  (As open_by_handle_at(2) demonstrates, this 
needn't be the case.)

2) the NOMTIME mode is only available via the open-by-handle interface, 
not open(2).

3) it is an ioctl interface, and thus more obscure.  (Well, there is a 
libhandle library, but it doesn't seem to be widely used.)

Would you object less if 

1) the O_NOMTIME flag were only available via open_by_handle_at(2)?

2) an equivalent ioctl were implemented for each file system of interest 
that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME 
flag?

3) O_NOMTIME required root (vs a mount option that requires root and 
unpriviledged O_NOMTIME)?

Just trying to tease apart which part is problematic...

Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Sage Weil
On Mon, 11 May 2015, Trond Myklebust wrote:
> On Mon, May 11, 2015 at 12:39 PM, Sage Weil  wrote:
> > On Mon, 11 May 2015, Dave Chinner wrote:
> >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil  wrote:
> >> > > I'm sure you realize what we're try to achieve is the same "invisible 
> >> > > IO"
> >> > > that the XFS open by handle ioctls do by default.  Would you be more
> >> > > comfortable if this option where only available to the generic
> >> > > open_by_handle syscall, and not to open(2)?
> >> >
> >> > It should be an ioctl(). It has no business being part of
> >> > open_by_handle either, since that is another generic interface.
> >
> > Our use-case doesn't make sense on network file systems, but it does on
> > any reasonably featureful local filesystem, and the goal is to be generic
> > there.  If mtime is critical to a network file system's consistency it
> > seems pretty reasonable to disallow/ignore it for just that file system
> > (e.g., by masking off the flag at open time), as others won't have that
> > same problem (cephfs doesn't, for example).
> >
> > Perhaps making each fs opt-in instead of handling it in a generic path
> > would alleviate this concern?
> 
> The issue isn't whether or not you have a network file system, it's
> whether or not you want users to be able to manage data. mtime isn't
> useful for the application (which knows whether or not it has changed
> the file) or for the filesystem (ditto). It exists, rather, in order
> to enable data management by users and other applications, letting
> them know whether or not the data contents of the file have changed,
> and when that change occurred.

Agreed.
 
> If you are able to guarantee that your users don't care about that,
> then fine, but that would be a very special case that doesn't fit the
> way that most data centres are run. Backups are one case where mtime
> matters, tiering and archiving is another.

This is true, although I argue it is becoming increasingly common for the 
data management (including backups and so forth) to be layered not on top 
of the POSIX file system but on something higher up in the stack. This is 
true of pretty much any distributed system (ceph, cassandra, mongo, etc., 
and I assume commercial databases like Oracle, too) where backups, 
replication, and any other DR strategies need to be orchestrated across 
nodes to be consistent--simply copying files out from underneath them is 
already insufficient and a recipe for disaster.

There is a growing category of applications that can benefit from this 
capability...

> Neither of these examples
> cases are under the control of the application that calls
> open(O_NOMTIME).

Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
nodes provisioned explicitly to run these systems would be enable this 
option.

> >> I'm happy for it to be an ioctl interface - even an XFS specific
> >> interface if you want to go that route, Sage - and it probably
> >> should emit a warning to syslog first time it is used so there is
> >> trace for bug triage purposes. i.e. we know the app is not using
> >> mtime updates, so bug reports that are the result of mtime
> >> mishandling don't result in large amounts of wasted developer time
> >> trying to understand them...
> >
> > A warning on using the interface (or when mounting with user_nomtime)
> > sounds reasonable.
> >
> > I'd rather not make this XFS specific as other local filesystmes (ext4,
> > f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
> > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
> > already does O_NOMTIME unconditionally.)
> 
> Lack of a namespace, doesn't imply that you don't want to manage the
> data. The whole point of using object storage instead of plain old
> block storage is to be able to provide whatever metadata you still
> need in order to manage the object.

Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
like to use) doesn't assume O_NOMTIME.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Sage Weil
On Mon, 11 May 2015, Dave Chinner wrote:
> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil  wrote:
> > > I'm sure you realize what we're try to achieve is the same "invisible IO"
> > > that the XFS open by handle ioctls do by default.  Would you be more
> > > comfortable if this option where only available to the generic
> > > open_by_handle syscall, and not to open(2)?
> > 
> > It should be an ioctl(). It has no business being part of
> > open_by_handle either, since that is another generic interface.

Our use-case doesn't make sense on network file systems, but it does on 
any reasonably featureful local filesystem, and the goal is to be generic 
there.  If mtime is critical to a network file system's consistency it 
seems pretty reasonable to disallow/ignore it for just that file system 
(e.g., by masking off the flag at open time), as others won't have that 
same problem (cephfs doesn't, for example).

Perhaps making each fs opt-in instead of handling it in a generic path 
would alleviate this concern?

> I'm happy for it to be an ioctl interface - even an XFS specific
> interface if you want to go that route, Sage - and it probably
> should emit a warning to syslog first time it is used so there is
> trace for bug triage purposes. i.e. we know the app is not using
> mtime updates, so bug reports that are the result of mtime
> mishandling don't result in large amounts of wasted developer time
> trying to understand them...

A warning on using the interface (or when mounting with user_nomtime) 
sounds reasonable.

I'd rather not make this XFS specific as other local filesystmes (ext4, 
f2fs, possibly btrfs) would similarly benefit.  (And if we want to target 
XFS specifically the existing XFS open-by-handle ioctl is sufficient as it 
already does O_NOMTIME unconditionally.)

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Sage Weil
On Mon, 11 May 2015, Theodore Ts'o wrote:
> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
> > That makes it completely non-generic though. By putting this in the
> > VFS, you are giving applications a loaded gun that is pointed straight
> > at the application user's head.
> 
> Let me re-ask the question that I asked last week (and was apparently
> ignored).  Why not trying to use the lazytime feature instead of
> pointing a head straight at the application's --- and system
> administrators' --- heads?

Sorry Ted, I thought I responded already.

The goal is to avoid inode writeout entirely when we can, and 
as I understand it lazytime will still force writeout before the inode 
is dropped from the cache.  In systems like Ceph in particular, the 
IOs can be spread across lots of files, so simply deferring writeout 
doesn't always help.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Sage Weil
On Mon, 11 May 2015, Theodore Ts'o wrote:
 On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
  That makes it completely non-generic though. By putting this in the
  VFS, you are giving applications a loaded gun that is pointed straight
  at the application user's head.
 
 Let me re-ask the question that I asked last week (and was apparently
 ignored).  Why not trying to use the lazytime feature instead of
 pointing a head straight at the application's --- and system
 administrators' --- heads?

Sorry Ted, I thought I responded already.

The goal is to avoid inode writeout entirely when we can, and 
as I understand it lazytime will still force writeout before the inode 
is dropped from the cache.  In systems like Ceph in particular, the 
IOs can be spread across lots of files, so simply deferring writeout 
doesn't always help.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Sage Weil
On Mon, 11 May 2015, Trond Myklebust wrote:
 On Mon, May 11, 2015 at 12:39 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 11 May 2015, Dave Chinner wrote:
  On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
   On Fri, May 8, 2015 at 6:24 PM, Sage Weil s...@newdream.net wrote:
I'm sure you realize what we're try to achieve is the same invisible 
IO
that the XFS open by handle ioctls do by default.  Would you be more
comfortable if this option where only available to the generic
open_by_handle syscall, and not to open(2)?
  
   It should be an ioctl(). It has no business being part of
   open_by_handle either, since that is another generic interface.
 
  Our use-case doesn't make sense on network file systems, but it does on
  any reasonably featureful local filesystem, and the goal is to be generic
  there.  If mtime is critical to a network file system's consistency it
  seems pretty reasonable to disallow/ignore it for just that file system
  (e.g., by masking off the flag at open time), as others won't have that
  same problem (cephfs doesn't, for example).
 
  Perhaps making each fs opt-in instead of handling it in a generic path
  would alleviate this concern?
 
 The issue isn't whether or not you have a network file system, it's
 whether or not you want users to be able to manage data. mtime isn't
 useful for the application (which knows whether or not it has changed
 the file) or for the filesystem (ditto). It exists, rather, in order
 to enable data management by users and other applications, letting
 them know whether or not the data contents of the file have changed,
 and when that change occurred.

Agreed.
 
 If you are able to guarantee that your users don't care about that,
 then fine, but that would be a very special case that doesn't fit the
 way that most data centres are run. Backups are one case where mtime
 matters, tiering and archiving is another.

This is true, although I argue it is becoming increasingly common for the 
data management (including backups and so forth) to be layered not on top 
of the POSIX file system but on something higher up in the stack. This is 
true of pretty much any distributed system (ceph, cassandra, mongo, etc., 
and I assume commercial databases like Oracle, too) where backups, 
replication, and any other DR strategies need to be orchestrated across 
nodes to be consistent--simply copying files out from underneath them is 
already insufficient and a recipe for disaster.

There is a growing category of applications that can benefit from this 
capability...

 Neither of these examples
 cases are under the control of the application that calls
 open(O_NOMTIME).

Wouldn't a mount option (e.g., allow_nomtime) address this concern?  Only 
nodes provisioned explicitly to run these systems would be enable this 
option.

  I'm happy for it to be an ioctl interface - even an XFS specific
  interface if you want to go that route, Sage - and it probably
  should emit a warning to syslog first time it is used so there is
  trace for bug triage purposes. i.e. we know the app is not using
  mtime updates, so bug reports that are the result of mtime
  mishandling don't result in large amounts of wasted developer time
  trying to understand them...
 
  A warning on using the interface (or when mounting with user_nomtime)
  sounds reasonable.
 
  I'd rather not make this XFS specific as other local filesystmes (ext4,
  f2fs, possibly btrfs) would similarly benefit.  (And if we want to target
  XFS specifically the existing XFS open-by-handle ioctl is sufficient as it
  already does O_NOMTIME unconditionally.)
 
 Lack of a namespace, doesn't imply that you don't want to manage the
 data. The whole point of using object storage instead of plain old
 block storage is to be able to provide whatever metadata you still
 need in order to manage the object.

Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd 
like to use) doesn't assume O_NOMTIME.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-11 Thread Sage Weil
On Mon, 11 May 2015, Dave Chinner wrote:
 On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote:
  On Fri, May 8, 2015 at 6:24 PM, Sage Weil s...@newdream.net wrote:
   I'm sure you realize what we're try to achieve is the same invisible IO
   that the XFS open by handle ioctls do by default.  Would you be more
   comfortable if this option where only available to the generic
   open_by_handle syscall, and not to open(2)?
  
  It should be an ioctl(). It has no business being part of
  open_by_handle either, since that is another generic interface.

Our use-case doesn't make sense on network file systems, but it does on 
any reasonably featureful local filesystem, and the goal is to be generic 
there.  If mtime is critical to a network file system's consistency it 
seems pretty reasonable to disallow/ignore it for just that file system 
(e.g., by masking off the flag at open time), as others won't have that 
same problem (cephfs doesn't, for example).

Perhaps making each fs opt-in instead of handling it in a generic path 
would alleviate this concern?

 I'm happy for it to be an ioctl interface - even an XFS specific
 interface if you want to go that route, Sage - and it probably
 should emit a warning to syslog first time it is used so there is
 trace for bug triage purposes. i.e. we know the app is not using
 mtime updates, so bug reports that are the result of mtime
 mishandling don't result in large amounts of wasted developer time
 trying to understand them...

A warning on using the interface (or when mounting with user_nomtime) 
sounds reasonable.

I'd rather not make this XFS specific as other local filesystmes (ext4, 
f2fs, possibly btrfs) would similarly benefit.  (And if we want to target 
XFS specifically the existing XFS open-by-handle ioctl is sufficient as it 
already does O_NOMTIME unconditionally.)

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-08 Thread Sage Weil
On Sat, 9 May 2015, Dave Chinner wrote:
> On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote:
> > On Thu, May 7, 2015 at 9:01 PM, Sage Weil  wrote:
> > > On Thu, 7 May 2015, Zach Brown wrote:
> > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > >> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > >> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > >> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > >> >
> > >> > I dislike "turn off safety for performance" options because Joe
> > >> > SpeedRacer will always select performance over safety.
> > >>
> > >> Well, for ceph there's no safety concern.  They never use cmtime in
> > >> these files.
> > >>
> > >> So are you suggesting not implementing this and making them rework their
> > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> > >> Speedracer more rope?  Or are we talking about adding some speed bumps
> > >> that ceph can flip on that might give Joe Speedracer pause?
> > >
> > > I think this is the fundamental question: who do we give the ammunition
> > > to, the user or app writer, or the sysadmin?
> > >
> > > One might argue that we gave the user a similar power with O_NOATIME (the
> > > power to break applications that assume atime is accurate).  Here we give
> > > developers/users the power to not update mtime and suffer the consequences
> > > (like, obviously, breaking mtime-based backups).  It should be pretty
> > > obvious to anyone using the flag what the consequences are.
> > >
> > > Note that we can suffer similar lapses in mtime with fdatasync followed by
> > > a system crash.  And as Andy points out it's semi-broken for writable
> > > mmap.  The crash case is obviously a slightly different thing, but the
> > > idea that mtime can't always be trusted certainly isn't crazy talk.
> > >
> > > Or, we can be conservative and require a mount option so that the admin
> > > has to explicitly allow behavior that might break some existing
> > > assumptions about mtime/ctime ('-o user_noatime' I guess?).
> > >
> > > I'm happy either way, so long as in the end an unprivileged ceph daemon
> > > avoids the useless work.  In our case we always own the entire mount/disk,
> > > so a mount option is just fine.
> > >
> > 
> > So, what is the expectation here for filesystems that cannot support
> > this flag? NFSv3 in particular would break pretty catastrophically if
> > someone decided on a whim to turn off mtime: they will have turned off
> > the client's ability to detect cache incoherencies.
> 
> It's worse than that, now that I think about it. I think nomtime
> will break nfsv4 as the I_VERSION check is done *after* the
> NO[C]MTIME checks. e.g. the atomic change count used to detect file
> changes is only updated during the mtime update on write() calls in
> XFS. i.e. when the timestamp is changed, a transaction to change
> mtime is run, and that transaction commit bumps the change count.
> 
> So cutting out mtime updates at the VFS will prevent XFS and other
> I_VERSION aware filesystems from updating the change count that
> NFSv4 clients rely on to detect foreign data changes in a file.
> 
> Not sure what to do here, because the current NOCMTIME
> implementation intentionally cuts out the timestamp update because
> it's usage is fully invisible IO. i.e. it is used by utilities like
> xfs_fsr and HSMs to move data into and out of files without the
> application being able to detect the data movement in any way. These
> are not data modification operations, though - the file contents as
> read by the application do not change despite the fact we are moving
> data in and out of the file. In this case we don't want timestamps
> or change counters to change on the data movement, so I think we've
> actually got a difference in behaviour here between O_NOMTIME and
> O_NOCMTIME, right?
> 
> i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on
> write, just not modify the timestamp? In which case, not modifying
> the timestamps gains us nothing, because the inode is still dirtied?

Right: if we dirty the inode we've defeated the purpose of the patch.

> The list of caveats on O_NOMTIME seems

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-08 Thread Sage Weil
On Thu, 7 May 2015, Trond Myklebust wrote:
> On Thu, May 7, 2015 at 9:01 PM, Sage Weil  wrote:
> > On Thu, 7 May 2015, Zach Brown wrote:
> >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> >> > > owning the file or having the CAP_FOWNER capability.  If we're not
> >> > > comfortable allowing owners to prevent mtime/ctime updates then we
> >> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> >> >
> >> > I dislike "turn off safety for performance" options because Joe
> >> > SpeedRacer will always select performance over safety.
> >>
> >> Well, for ceph there's no safety concern.  They never use cmtime in
> >> these files.
> >>
> >> So are you suggesting not implementing this and making them rework their
> >> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> >> Speedracer more rope?  Or are we talking about adding some speed bumps
> >> that ceph can flip on that might give Joe Speedracer pause?
> >
> > I think this is the fundamental question: who do we give the ammunition
> > to, the user or app writer, or the sysadmin?
> >
> > One might argue that we gave the user a similar power with O_NOATIME (the
> > power to break applications that assume atime is accurate).  Here we give
> > developers/users the power to not update mtime and suffer the consequences
> > (like, obviously, breaking mtime-based backups).  It should be pretty
> > obvious to anyone using the flag what the consequences are.
> >
> > Note that we can suffer similar lapses in mtime with fdatasync followed by
> > a system crash.  And as Andy points out it's semi-broken for writable
> > mmap.  The crash case is obviously a slightly different thing, but the
> > idea that mtime can't always be trusted certainly isn't crazy talk.
> >
> > Or, we can be conservative and require a mount option so that the admin
> > has to explicitly allow behavior that might break some existing
> > assumptions about mtime/ctime ('-o user_noatime' I guess?).
> >
> > I'm happy either way, so long as in the end an unprivileged ceph daemon
> > avoids the useless work.  In our case we always own the entire mount/disk,
> > so a mount option is just fine.
> >
> 
> So, what is the expectation here for filesystems that cannot support
> this flag? NFSv3 in particular would break pretty catastrophically if
> someone decided on a whim to turn off mtime: they will have turned off
> the client's ability to detect cache incoherencies.

Is this based on mtime or ctime?  If the former, would things could also 
break if a user does, say, some stat(2), write(2), utimes(2) shenanigans?

So, my assumption is that if the mount option isn't there allowing this 
then O_NOMTIME would be a no-op (as opposed to EPERM or something)... but 
maybe that's not the right thing to do.  Whatever we do there, though, I 
suppose NFS would do the same thing?

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-08 Thread Sage Weil
On Thu, 7 May 2015, Trond Myklebust wrote:
 On Thu, May 7, 2015 at 9:01 PM, Sage Weil s...@newdream.net wrote:
  On Thu, 7 May 2015, Zach Brown wrote:
  On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
   On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
The criteria for using O_NOMTIME is the same as for using O_NOATIME:
owning the file or having the CAP_FOWNER capability.  If we're not
comfortable allowing owners to prevent mtime/ctime updates then we
should add a tunable to allow O_NOMTIME.  Maybe a mount option?
  
   I dislike turn off safety for performance options because Joe
   SpeedRacer will always select performance over safety.
 
  Well, for ceph there's no safety concern.  They never use cmtime in
  these files.
 
  So are you suggesting not implementing this and making them rework their
  IO paths to avoid the fs maintaining mtime so that we don't give Joe
  Speedracer more rope?  Or are we talking about adding some speed bumps
  that ceph can flip on that might give Joe Speedracer pause?
 
  I think this is the fundamental question: who do we give the ammunition
  to, the user or app writer, or the sysadmin?
 
  One might argue that we gave the user a similar power with O_NOATIME (the
  power to break applications that assume atime is accurate).  Here we give
  developers/users the power to not update mtime and suffer the consequences
  (like, obviously, breaking mtime-based backups).  It should be pretty
  obvious to anyone using the flag what the consequences are.
 
  Note that we can suffer similar lapses in mtime with fdatasync followed by
  a system crash.  And as Andy points out it's semi-broken for writable
  mmap.  The crash case is obviously a slightly different thing, but the
  idea that mtime can't always be trusted certainly isn't crazy talk.
 
  Or, we can be conservative and require a mount option so that the admin
  has to explicitly allow behavior that might break some existing
  assumptions about mtime/ctime ('-o user_noatime' I guess?).
 
  I'm happy either way, so long as in the end an unprivileged ceph daemon
  avoids the useless work.  In our case we always own the entire mount/disk,
  so a mount option is just fine.
 
 
 So, what is the expectation here for filesystems that cannot support
 this flag? NFSv3 in particular would break pretty catastrophically if
 someone decided on a whim to turn off mtime: they will have turned off
 the client's ability to detect cache incoherencies.

Is this based on mtime or ctime?  If the former, would things could also 
break if a user does, say, some stat(2), write(2), utimes(2) shenanigans?

So, my assumption is that if the mount option isn't there allowing this 
then O_NOMTIME would be a no-op (as opposed to EPERM or something)... but 
maybe that's not the right thing to do.  Whatever we do there, though, I 
suppose NFS would do the same thing?

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-08 Thread Sage Weil
On Sat, 9 May 2015, Dave Chinner wrote:
 On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote:
  On Thu, May 7, 2015 at 9:01 PM, Sage Weil s...@newdream.net wrote:
   On Thu, 7 May 2015, Zach Brown wrote:
   On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
 The criteria for using O_NOMTIME is the same as for using O_NOATIME:
 owning the file or having the CAP_FOWNER capability.  If we're not
 comfortable allowing owners to prevent mtime/ctime updates then we
 should add a tunable to allow O_NOMTIME.  Maybe a mount option?
   
I dislike turn off safety for performance options because Joe
SpeedRacer will always select performance over safety.
  
   Well, for ceph there's no safety concern.  They never use cmtime in
   these files.
  
   So are you suggesting not implementing this and making them rework their
   IO paths to avoid the fs maintaining mtime so that we don't give Joe
   Speedracer more rope?  Or are we talking about adding some speed bumps
   that ceph can flip on that might give Joe Speedracer pause?
  
   I think this is the fundamental question: who do we give the ammunition
   to, the user or app writer, or the sysadmin?
  
   One might argue that we gave the user a similar power with O_NOATIME (the
   power to break applications that assume atime is accurate).  Here we give
   developers/users the power to not update mtime and suffer the consequences
   (like, obviously, breaking mtime-based backups).  It should be pretty
   obvious to anyone using the flag what the consequences are.
  
   Note that we can suffer similar lapses in mtime with fdatasync followed by
   a system crash.  And as Andy points out it's semi-broken for writable
   mmap.  The crash case is obviously a slightly different thing, but the
   idea that mtime can't always be trusted certainly isn't crazy talk.
  
   Or, we can be conservative and require a mount option so that the admin
   has to explicitly allow behavior that might break some existing
   assumptions about mtime/ctime ('-o user_noatime' I guess?).
  
   I'm happy either way, so long as in the end an unprivileged ceph daemon
   avoids the useless work.  In our case we always own the entire mount/disk,
   so a mount option is just fine.
  
  
  So, what is the expectation here for filesystems that cannot support
  this flag? NFSv3 in particular would break pretty catastrophically if
  someone decided on a whim to turn off mtime: they will have turned off
  the client's ability to detect cache incoherencies.
 
 It's worse than that, now that I think about it. I think nomtime
 will break nfsv4 as the I_VERSION check is done *after* the
 NO[C]MTIME checks. e.g. the atomic change count used to detect file
 changes is only updated during the mtime update on write() calls in
 XFS. i.e. when the timestamp is changed, a transaction to change
 mtime is run, and that transaction commit bumps the change count.
 
 So cutting out mtime updates at the VFS will prevent XFS and other
 I_VERSION aware filesystems from updating the change count that
 NFSv4 clients rely on to detect foreign data changes in a file.
 
 Not sure what to do here, because the current NOCMTIME
 implementation intentionally cuts out the timestamp update because
 it's usage is fully invisible IO. i.e. it is used by utilities like
 xfs_fsr and HSMs to move data into and out of files without the
 application being able to detect the data movement in any way. These
 are not data modification operations, though - the file contents as
 read by the application do not change despite the fact we are moving
 data in and out of the file. In this case we don't want timestamps
 or change counters to change on the data movement, so I think we've
 actually got a difference in behaviour here between O_NOMTIME and
 O_NOCMTIME, right?
 
 i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on
 write, just not modify the timestamp? In which case, not modifying
 the timestamps gains us nothing, because the inode is still dirtied?

Right: if we dirty the inode we've defeated the purpose of the patch.

 The list of caveats on O_NOMTIME seems to be growing...

...and remain consistent with our goals.  We couldn't care less if NFS or 
backup software or anything else doesn't notice these changes.  This is 
private data that is wholly managed by the ceph daemon.  The goal is to 
derive *some* value from the file system and avoid reimplementing it in 
userspace (without the bits we don't need).

I'm sure you realize what we're try to achieve is the same invisible IO 
that the XFS open by handle ioctls do by default.  Would you be more 
comfortable if this option where only available to the generic 
open_by_handle syscall, and not to open(2)?

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-07 Thread Sage Weil
On Thu, 7 May 2015, Zach Brown wrote:
> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > 
> > I dislike "turn off safety for performance" options because Joe
> > SpeedRacer will always select performance over safety.
> 
> Well, for ceph there's no safety concern.  They never use cmtime in
> these files.
> 
> So are you suggesting not implementing this and making them rework their
> IO paths to avoid the fs maintaining mtime so that we don't give Joe
> Speedracer more rope?  Or are we talking about adding some speed bumps
> that ceph can flip on that might give Joe Speedracer pause?

I think this is the fundamental question: who do we give the ammunition 
to, the user or app writer, or the sysadmin?

One might argue that we gave the user a similar power with O_NOATIME (the 
power to break applications that assume atime is accurate).  Here we give 
developers/users the power to not update mtime and suffer the consequences 
(like, obviously, breaking mtime-based backups).  It should be pretty 
obvious to anyone using the flag what the consequences are.

Note that we can suffer similar lapses in mtime with fdatasync followed by 
a system crash.  And as Andy points out it's semi-broken for writable 
mmap.  The crash case is obviously a slightly different thing, but the 
idea that mtime can't always be trusted certainly isn't crazy talk.

Or, we can be conservative and require a mount option so that the admin 
has to explicitly allow behavior that might break some existing 
assumptions about mtime/ctime ('-o user_noatime' I guess?).

I'm happy either way, so long as in the end an unprivileged ceph daemon 
avoids the useless work.  In our case we always own the entire mount/disk, 
so a mount option is just fine.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-07 Thread Sage Weil
On Thu, 7 May 2015, Zach Brown wrote:
 On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote:
  On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote:
   The criteria for using O_NOMTIME is the same as for using O_NOATIME:
   owning the file or having the CAP_FOWNER capability.  If we're not
   comfortable allowing owners to prevent mtime/ctime updates then we
   should add a tunable to allow O_NOMTIME.  Maybe a mount option?
  
  I dislike turn off safety for performance options because Joe
  SpeedRacer will always select performance over safety.
 
 Well, for ceph there's no safety concern.  They never use cmtime in
 these files.
 
 So are you suggesting not implementing this and making them rework their
 IO paths to avoid the fs maintaining mtime so that we don't give Joe
 Speedracer more rope?  Or are we talking about adding some speed bumps
 that ceph can flip on that might give Joe Speedracer pause?

I think this is the fundamental question: who do we give the ammunition 
to, the user or app writer, or the sysadmin?

One might argue that we gave the user a similar power with O_NOATIME (the 
power to break applications that assume atime is accurate).  Here we give 
developers/users the power to not update mtime and suffer the consequences 
(like, obviously, breaking mtime-based backups).  It should be pretty 
obvious to anyone using the flag what the consequences are.

Note that we can suffer similar lapses in mtime with fdatasync followed by 
a system crash.  And as Andy points out it's semi-broken for writable 
mmap.  The crash case is obviously a slightly different thing, but the 
idea that mtime can't always be trusted certainly isn't crazy talk.

Or, we can be conservative and require a mount option so that the admin 
has to explicitly allow behavior that might break some existing 
assumptions about mtime/ctime ('-o user_noatime' I guess?).

I'm happy either way, so long as in the end an unprivileged ceph daemon 
avoids the useless work.  In our case we always own the entire mount/disk, 
so a mount option is just fine.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-06 Thread Sage Weil
On Wed, 6 May 2015, Zach Brown wrote:
> On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote:
> > On Wed, 6 May 2015, Trond Myklebust wrote:
> > > Hi Zach,
> > > 
> > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown  wrote:
> > > >
> > > > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > > > greatly reduce the IO overhead of writes to allocated and initialized
> > > > regions of files.
> > > >
> > > > ceph servers can have loads where they perform O_DIRECT overwrites of
> > > > allocated file data and then sync to make sure that the O_DIRECT writes
> > > > are flushed from write caches.  If the writes dirty the inode with mtime
> > > > updates then the syncs also write out the metadata needed to track the
> > > > inodes which can add significant iop and latency overhead.
> > > >
> > > > The ceph servers don't use mtime at all.  They're using the local file
> > > > system as a backing store and any backups would be driven by their upper
> > > > level ceph metadata.  For ceph, slow IO from mtime updates in the file
> > > > system is as daft as if we had block devices slowing down IO for
> > > > per-block write timestamps that file systems never use.
> > > >
> > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> > > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> > > > IO round trips to 1 in ext4.
> > > >
> > > > file_update_time() checks for O_NOMTIME and aborts the update if it's
> > > > set, just like the current check for the in-kernel inode flag
> > > > S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> > > > added as we decide that it's appropriate to do so.
> > > >
> > > > I opted not to name the flag O_NOCMTIME because I didn't want the name
> > > > to imply that ctime updates would be prevented for other inode changes
> > > > like updating i_size in truncate.  Not updating ctime is a side-effect
> > > > of removing mtime updates when it's the only thing changing in the
> > > > inode.
> > > >
> > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > > > owning the file or having the CAP_FOWNER capability.  If we're not
> > > > comfortable allowing owners to prevent mtime/ctime updates then we
> > > > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> > > >
> > > 
> > > Just out of curiosity, if you need to modify the application anyway,
> > > why wouldn't use of fdatasync() when flushing be able to offer a
> > > similar performance boost?
> > 
> > Although fdatasync(2) doesn't have to update synchronously, it does 
> > eventually get written, and that can trigger lots of unwanted IO.
> 
> And the unwanted IO is per file.  Are there circumstances where the
> write:file ratio is small enough that dirty inode writes could start to
> add up to meaningful write amplification?

Yeah, exactly: in some not-so-uncommon workloads it's approaching 1:1.

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-06 Thread Sage Weil
On Wed, 6 May 2015, Trond Myklebust wrote:
> Hi Zach,
> 
> On Wed, May 6, 2015 at 6:00 PM, Zach Brown  wrote:
> >
> > Add the O_NOMTIME flag which prevents mtime from being updated which can
> > greatly reduce the IO overhead of writes to allocated and initialized
> > regions of files.
> >
> > ceph servers can have loads where they perform O_DIRECT overwrites of
> > allocated file data and then sync to make sure that the O_DIRECT writes
> > are flushed from write caches.  If the writes dirty the inode with mtime
> > updates then the syncs also write out the metadata needed to track the
> > inodes which can add significant iop and latency overhead.
> >
> > The ceph servers don't use mtime at all.  They're using the local file
> > system as a backing store and any backups would be driven by their upper
> > level ceph metadata.  For ceph, slow IO from mtime updates in the file
> > system is as daft as if we had block devices slowing down IO for
> > per-block write timestamps that file systems never use.
> >
> > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
> > sync went from 2 serial write round trips to 1 in XFS and from 4 serial
> > IO round trips to 1 in ext4.
> >
> > file_update_time() checks for O_NOMTIME and aborts the update if it's
> > set, just like the current check for the in-kernel inode flag
> > S_NOCMTIME.  I didn't update any other mtime update sites. They could be
> > added as we decide that it's appropriate to do so.
> >
> > I opted not to name the flag O_NOCMTIME because I didn't want the name
> > to imply that ctime updates would be prevented for other inode changes
> > like updating i_size in truncate.  Not updating ctime is a side-effect
> > of removing mtime updates when it's the only thing changing in the
> > inode.
> >
> > The criteria for using O_NOMTIME is the same as for using O_NOATIME:
> > owning the file or having the CAP_FOWNER capability.  If we're not
> > comfortable allowing owners to prevent mtime/ctime updates then we
> > should add a tunable to allow O_NOMTIME.  Maybe a mount option?
> >
> 
> Just out of curiosity, if you need to modify the application anyway,
> why wouldn't use of fdatasync() when flushing be able to offer a
> similar performance boost?

Although fdatasync(2) doesn't have to update synchronously, it does 
eventually get written, and that can trigger lots of unwanted IO.

In practice we fsync(2) to avoid deferred IO that we can't control/bound, 
but that's a long and sad story.  O_NOMTIME would make for a much better 
ending!

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-06 Thread Sage Weil
On Wed, 6 May 2015, Zach Brown wrote:
 On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote:
  On Wed, 6 May 2015, Trond Myklebust wrote:
   Hi Zach,
   
   On Wed, May 6, 2015 at 6:00 PM, Zach Brown z...@redhat.com wrote:
   
Add the O_NOMTIME flag which prevents mtime from being updated which can
greatly reduce the IO overhead of writes to allocated and initialized
regions of files.
   
ceph servers can have loads where they perform O_DIRECT overwrites of
allocated file data and then sync to make sure that the O_DIRECT writes
are flushed from write caches.  If the writes dirty the inode with mtime
updates then the syncs also write out the metadata needed to track the
inodes which can add significant iop and latency overhead.
   
The ceph servers don't use mtime at all.  They're using the local file
system as a backing store and any backups would be driven by their upper
level ceph metadata.  For ceph, slow IO from mtime updates in the file
system is as daft as if we had block devices slowing down IO for
per-block write timestamps that file systems never use.
   
In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
sync went from 2 serial write round trips to 1 in XFS and from 4 serial
IO round trips to 1 in ext4.
   
file_update_time() checks for O_NOMTIME and aborts the update if it's
set, just like the current check for the in-kernel inode flag
S_NOCMTIME.  I didn't update any other mtime update sites. They could be
added as we decide that it's appropriate to do so.
   
I opted not to name the flag O_NOCMTIME because I didn't want the name
to imply that ctime updates would be prevented for other inode changes
like updating i_size in truncate.  Not updating ctime is a side-effect
of removing mtime updates when it's the only thing changing in the
inode.
   
The criteria for using O_NOMTIME is the same as for using O_NOATIME:
owning the file or having the CAP_FOWNER capability.  If we're not
comfortable allowing owners to prevent mtime/ctime updates then we
should add a tunable to allow O_NOMTIME.  Maybe a mount option?
   
   
   Just out of curiosity, if you need to modify the application anyway,
   why wouldn't use of fdatasync() when flushing be able to offer a
   similar performance boost?
  
  Although fdatasync(2) doesn't have to update synchronously, it does 
  eventually get written, and that can trigger lots of unwanted IO.
 
 And the unwanted IO is per file.  Are there circumstances where the
 write:file ratio is small enough that dirty inode writes could start to
 add up to meaningful write amplification?

Yeah, exactly: in some not-so-uncommon workloads it's approaching 1:1.

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] vfs: add a O_NOMTIME flag

2015-05-06 Thread Sage Weil
On Wed, 6 May 2015, Trond Myklebust wrote:
 Hi Zach,
 
 On Wed, May 6, 2015 at 6:00 PM, Zach Brown z...@redhat.com wrote:
 
  Add the O_NOMTIME flag which prevents mtime from being updated which can
  greatly reduce the IO overhead of writes to allocated and initialized
  regions of files.
 
  ceph servers can have loads where they perform O_DIRECT overwrites of
  allocated file data and then sync to make sure that the O_DIRECT writes
  are flushed from write caches.  If the writes dirty the inode with mtime
  updates then the syncs also write out the metadata needed to track the
  inodes which can add significant iop and latency overhead.
 
  The ceph servers don't use mtime at all.  They're using the local file
  system as a backing store and any backups would be driven by their upper
  level ceph metadata.  For ceph, slow IO from mtime updates in the file
  system is as daft as if we had block devices slowing down IO for
  per-block write timestamps that file systems never use.
 
  In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a
  sync went from 2 serial write round trips to 1 in XFS and from 4 serial
  IO round trips to 1 in ext4.
 
  file_update_time() checks for O_NOMTIME and aborts the update if it's
  set, just like the current check for the in-kernel inode flag
  S_NOCMTIME.  I didn't update any other mtime update sites. They could be
  added as we decide that it's appropriate to do so.
 
  I opted not to name the flag O_NOCMTIME because I didn't want the name
  to imply that ctime updates would be prevented for other inode changes
  like updating i_size in truncate.  Not updating ctime is a side-effect
  of removing mtime updates when it's the only thing changing in the
  inode.
 
  The criteria for using O_NOMTIME is the same as for using O_NOATIME:
  owning the file or having the CAP_FOWNER capability.  If we're not
  comfortable allowing owners to prevent mtime/ctime updates then we
  should add a tunable to allow O_NOMTIME.  Maybe a mount option?
 
 
 Just out of curiosity, if you need to modify the application anyway,
 why wouldn't use of fdatasync() when flushing be able to offer a
 similar performance boost?

Although fdatasync(2) doesn't have to update synchronously, it does 
eventually get written, and that can trigger lots of unwanted IO.

In practice we fsync(2) to avoid deferred IO that we can't control/bound, 
but that's a long and sad story.  O_NOMTIME would make for a much better 
ending!

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph RBD fix for -rc2

2015-05-01 Thread Sage Weil
Hi Linus,

Please pull the following RBD fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

Thanks!
sage


Ilya Dryomov (1):
  rbd: end I/O the entire obj_request on error

 drivers/block/rbd.c |5 +
 1 file changed, 5 insertions(+)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph RBD fix for -rc2

2015-05-01 Thread Sage Weil
Hi Linus,

Please pull the following RBD fix from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

Thanks!
sage


Ilya Dryomov (1):
  rbd: end I/O the entire obj_request on error

 drivers/block/rbd.c |5 +
 1 file changed, 5 insertions(+)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for 4.1-rc1

2015-04-22 Thread Sage Weil
Hi Linux,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This time around we have a collection of CephFS fixes from Zheng around 
MDS failure handling and snapshots, support for a new CRUSH straw2 
algorithm (to sync up with userspace) and several RBD cleanups and fixes 
from Ilya, an error path leak fix from Taesoo, and then an assorted 
collection of cleanups from others.

Thanks!
sage



Fabian Frederick (1):
  ceph: remove redundant declaration

Ilya Dryomov (12):
  rbd: be more informative on -ENOENT failures
  libceph: don't overwrite specific con error msgs
  rbd: mark block queue as non-rotational
  libceph, ceph: split ceph_show_options()
  libceph: expose client options through debugfs
  ceph: show non-default options only
  libceph: simplify our debugfs attr macro
  crush: drop unnecessary include from mapper.c
  crush: ensuring at most num-rep osds are selected
  crush: straw2 bucket type with an efficient 64-bit crush_ln()
  libceph: announce support for straw2 buckets
  rbd: rbd_wq comment is obsolete

Joe Perches (1):
  libceph: osdmap.h: Add missing format newlines

Nicholas Mc Guire (2):
  ceph: use msecs_to_jiffies for time conversion
  ceph: match wait_for_completion_timeout return type

Sanidhya Kashyap (1):
  ceph: kstrdup() memory handling

Taesoo Kim (1):
  ceph: properly release page upon error

Yan, Zheng (10):
  ceph: drop cap releases in requests composed before cap reconnect
  ceph: fix dcache/nocache mount option
  ceph: keep i_snap_realm while there are writers
  ceph: don't mark dirty caps when there is no auth cap
  ceph: don't zero i_wrbuffer_ref when reconnecting is denied
  ceph: cleanup unsafe requests when reconnecting is denied
  ceph: hold on to exclusive caps on complete directories
  ceph: fix null pointer dereference in send_mds_reconnect()
  ceph: rename snapshot support
  ceph: fix uninline data function

 drivers/block/rbd.c|   26 --
 fs/ceph/addr.c |   38 ++---
 fs/ceph/caps.c |   51 ---
 fs/ceph/dir.c  |   48 ---
 fs/ceph/mds_client.c   |   61 +
 fs/ceph/strings.c  |1 +
 fs/ceph/super.c|   56 ++--
 fs/ceph/super.h|4 +-
 fs/ceph/xattr.c|   23 +++--
 include/linux/ceph/ceph_features.h |   16 +++-
 include/linux/ceph/ceph_fs.h   |1 +
 include/linux/ceph/debugfs.h   |8 +-
 include/linux/ceph/libceph.h   |2 +
 include/linux/ceph/osdmap.h|5 +-
 include/linux/crush/crush.h|   12 ++-
 net/ceph/ceph_common.c |   37 
 net/ceph/crush/crush.c |   14 +++
 net/ceph/crush/crush_ln_table.h|  166 
 net/ceph/crush/mapper.c|  118 +++--
 net/ceph/debugfs.c |   24 ++
 net/ceph/messenger.c   |   25 +++---
 net/ceph/osdmap.c  |   25 ++
 22 files changed, 633 insertions(+), 128 deletions(-)
 create mode 100644 net/ceph/crush/crush_ln_table.h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for 4.1-rc1

2015-04-22 Thread Sage Weil
Hi Linux,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This time around we have a collection of CephFS fixes from Zheng around 
MDS failure handling and snapshots, support for a new CRUSH straw2 
algorithm (to sync up with userspace) and several RBD cleanups and fixes 
from Ilya, an error path leak fix from Taesoo, and then an assorted 
collection of cleanups from others.

Thanks!
sage



Fabian Frederick (1):
  ceph: remove redundant declaration

Ilya Dryomov (12):
  rbd: be more informative on -ENOENT failures
  libceph: don't overwrite specific con error msgs
  rbd: mark block queue as non-rotational
  libceph, ceph: split ceph_show_options()
  libceph: expose client options through debugfs
  ceph: show non-default options only
  libceph: simplify our debugfs attr macro
  crush: drop unnecessary include from mapper.c
  crush: ensuring at most num-rep osds are selected
  crush: straw2 bucket type with an efficient 64-bit crush_ln()
  libceph: announce support for straw2 buckets
  rbd: rbd_wq comment is obsolete

Joe Perches (1):
  libceph: osdmap.h: Add missing format newlines

Nicholas Mc Guire (2):
  ceph: use msecs_to_jiffies for time conversion
  ceph: match wait_for_completion_timeout return type

Sanidhya Kashyap (1):
  ceph: kstrdup() memory handling

Taesoo Kim (1):
  ceph: properly release page upon error

Yan, Zheng (10):
  ceph: drop cap releases in requests composed before cap reconnect
  ceph: fix dcache/nocache mount option
  ceph: keep i_snap_realm while there are writers
  ceph: don't mark dirty caps when there is no auth cap
  ceph: don't zero i_wrbuffer_ref when reconnecting is denied
  ceph: cleanup unsafe requests when reconnecting is denied
  ceph: hold on to exclusive caps on complete directories
  ceph: fix null pointer dereference in send_mds_reconnect()
  ceph: rename snapshot support
  ceph: fix uninline data function

 drivers/block/rbd.c|   26 --
 fs/ceph/addr.c |   38 ++---
 fs/ceph/caps.c |   51 ---
 fs/ceph/dir.c  |   48 ---
 fs/ceph/mds_client.c   |   61 +
 fs/ceph/strings.c  |1 +
 fs/ceph/super.c|   56 ++--
 fs/ceph/super.h|4 +-
 fs/ceph/xattr.c|   23 +++--
 include/linux/ceph/ceph_features.h |   16 +++-
 include/linux/ceph/ceph_fs.h   |1 +
 include/linux/ceph/debugfs.h   |8 +-
 include/linux/ceph/libceph.h   |2 +
 include/linux/ceph/osdmap.h|5 +-
 include/linux/crush/crush.h|   12 ++-
 net/ceph/ceph_common.c |   37 
 net/ceph/crush/crush.c |   14 +++
 net/ceph/crush/crush_ln_table.h|  166 
 net/ceph/crush/mapper.c|  118 +++--
 net/ceph/debugfs.c |   24 ++
 net/ceph/messenger.c   |   25 +++---
 net/ceph/osdmap.c  |   25 ++
 22 files changed, 633 insertions(+), 128 deletions(-)
 create mode 100644 net/ceph/crush/crush_ln_table.h
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: randconfig build error with next-20150421, in net/ceph

2015-04-21 Thread Sage Weil
On Tue, 21 Apr 2015, Guenter Roeck wrote:
> On Tue, Apr 21, 2015 at 08:10:44AM -0700, Jim Davis wrote:
> > Building with the attached random configuration file,
> > 
> > ERROR: "__divdi3" [net/ceph/libceph.ko] undefined!
> 
> Commit 7321f19d ("crush: straw2 bucket type with an efficient 64-bit 
> crush_ln()").
> 
> +   draw = ln / w;
> 
> where 'ln' is 64 bit.
> 
> Some other oddies in that patch, such as 
> 
> +#if defined(__linux__)
> +#include 
> +#elif defined(__FreeBSD__)
> +#include 
> +#endif
> 
> and lots of coding style violations.

Thanks for the report--we'll fix it up!

sage
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: randconfig build error with next-20150421, in net/ceph

2015-04-21 Thread Sage Weil
On Tue, 21 Apr 2015, Guenter Roeck wrote:
 On Tue, Apr 21, 2015 at 08:10:44AM -0700, Jim Davis wrote:
  Building with the attached random configuration file,
  
  ERROR: __divdi3 [net/ceph/libceph.ko] undefined!
 
 Commit 7321f19d (crush: straw2 bucket type with an efficient 64-bit 
 crush_ln()).
 
 +   draw = ln / w;
 
 where 'ln' is 64 bit.
 
 Some other oddies in that patch, such as 
 
 +#if defined(__linux__)
 +#include linux/types.h
 +#elif defined(__FreeBSD__)
 +#include sys/types.h
 +#endif
 
 and lots of coding style violations.

Thanks for the report--we'll fix it up!

sage
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fix for -rc8

2015-04-07 Thread Sage Weil
Hi Linus,

Please pull the following patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This corrects a recent misadventure with __GFP_MEMALLOC and PF_MEMALLOC; 
it turns out it's not a good fit for RBD and we're better off relying on 
dirty page throttling.

Thanks!
sage



Ilya Dryomov (1):
  Revert "libceph: use memalloc flags for net IO"

 net/ceph/messenger.c |9 +
 1 file changed, 1 insertion(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fix for -rc8

2015-04-07 Thread Sage Weil
Hi Linus,

Please pull the following patch from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This corrects a recent misadventure with __GFP_MEMALLOC and PF_MEMALLOC; 
it turns out it's not a good fit for RBD and we're better off relying on 
dirty page throttling.

Thanks!
sage



Ilya Dryomov (1):
  Revert libceph: use memalloc flags for net IO

 net/ceph/messenger.c |9 +
 1 file changed, 1 insertion(+), 8 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph changes for {3.20,4.0}-rc1

2015-02-19 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

On the RBD side, there is a conversion to blk-mq from Christoph, several 
long-standing bug fixes from Ilya, and some cleanup from Rickard 
Strandqvist.  On the CephFS side there is a long list of fixes from Zheng, 
including improved session handling, a few IO path fixes, some dcache 
management correctness fixes, and several blocking while !TASK_RUNNING 
fixes.  The core code gets a few cleanups and Chaitanya has added support 
for TCP_NODELAY (which has been used on the server side for ages but we 
somehow missed on the kernel client).

There is also an update to MAINTAINERS to fix up some email addresses and 
reflect that Ilya and Zheng are doing most of the maintenance for RBD and 
CephFS these days.  Do not be surprised to see a pull request come from 
one of them in the future if I am unavailable for some reason.

Thanks!
sage


Chaitanya Huilgol (1):
  libceph: tcp_nodelay support

Christoph Hellwig (1):
  rbd: convert to blk-mq

Ilya Dryomov (7):
  libceph: nuke pool op infrastructure
  libceph: use mon_client.c/put_generic_request() more
  rbd: fix error paths in rbd_dev_refresh()
  rbd: do not treat standalone as flatten
  ceph: show nocephx_require_signatures and notcp_nodelay options
  libceph: fix double __remove_osd() problem
  libceph: kfree() in put_osd() shouldn't depend on authorizer

Rickard Strandqvist (2):
  rbd: nuke copy_token()
  ceph: acl: Remove unused function

Sage Weil (1):
  MAINTAINERS: update Ceph and RBD maintainers

Yan, Zheng (15):
  ceph: handle SESSION_FORCE_RO message
  ceph: properly zero data pages for file holes.
  ceph: improve reference tracking for snaprealm
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
  ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
  ceph: fix reading inline data when i_size > PAGE_SIZE
  ceph: fix request time stamp encoding
  ceph: provide seperate {inode,file}_operations for snapdir
  client: include kernel version in client metadata
  ceph: properly mark empty directory as complete
  ceph: fix atomic_open snapdir
  ceph: re-send requests when MDS enters reconnecting stage
  ceph: fix dentry leaks
  ceph: return error for traceless reply race

 MAINTAINERS |7 +-
 drivers/block/rbd.c |  193 +--
 fs/ceph/acl.c   |   14 ---
 fs/ceph/addr.c  |   19 ++--
 fs/ceph/caps.c  |  127 +++---
 fs/ceph/dir.c   |   33 +--
 fs/ceph/file.c  |   37 +---
 fs/ceph/inode.c |   41 +
 fs/ceph/mds_client.c|  127 +++---
 fs/ceph/mds_client.h|2 +
 fs/ceph/snap.c  |   54 +++
 fs/ceph/super.c |4 +
 fs/ceph/super.h |5 +-
 include/linux/ceph/ceph_fs.h|   37 +---
 include/linux/ceph/libceph.h|3 +-
 include/linux/ceph/messenger.h  |4 +-
 include/linux/ceph/mon_client.h |9 +-
 net/ceph/ceph_common.c  |   16 +++-
 net/ceph/ceph_strings.c |   14 ---
 net/ceph/debugfs.c  |2 -
 net/ceph/messenger.c|   14 ++-
 net/ceph/mon_client.c   |  139 +---
 net/ceph/osd_client.c   |   31 +--
 23 files changed, 444 insertions(+), 488 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph changes for {3.20,4.0}-rc1

2015-02-19 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

On the RBD side, there is a conversion to blk-mq from Christoph, several 
long-standing bug fixes from Ilya, and some cleanup from Rickard 
Strandqvist.  On the CephFS side there is a long list of fixes from Zheng, 
including improved session handling, a few IO path fixes, some dcache 
management correctness fixes, and several blocking while !TASK_RUNNING 
fixes.  The core code gets a few cleanups and Chaitanya has added support 
for TCP_NODELAY (which has been used on the server side for ages but we 
somehow missed on the kernel client).

There is also an update to MAINTAINERS to fix up some email addresses and 
reflect that Ilya and Zheng are doing most of the maintenance for RBD and 
CephFS these days.  Do not be surprised to see a pull request come from 
one of them in the future if I am unavailable for some reason.

Thanks!
sage


Chaitanya Huilgol (1):
  libceph: tcp_nodelay support

Christoph Hellwig (1):
  rbd: convert to blk-mq

Ilya Dryomov (7):
  libceph: nuke pool op infrastructure
  libceph: use mon_client.c/put_generic_request() more
  rbd: fix error paths in rbd_dev_refresh()
  rbd: do not treat standalone as flatten
  ceph: show nocephx_require_signatures and notcp_nodelay options
  libceph: fix double __remove_osd() problem
  libceph: kfree() in put_osd() shouldn't depend on authorizer

Rickard Strandqvist (2):
  rbd: nuke copy_token()
  ceph: acl: Remove unused function

Sage Weil (1):
  MAINTAINERS: update Ceph and RBD maintainers

Yan, Zheng (15):
  ceph: handle SESSION_FORCE_RO message
  ceph: properly zero data pages for file holes.
  ceph: improve reference tracking for snaprealm
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)
  ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps)
  ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)
  ceph: fix reading inline data when i_size  PAGE_SIZE
  ceph: fix request time stamp encoding
  ceph: provide seperate {inode,file}_operations for snapdir
  client: include kernel version in client metadata
  ceph: properly mark empty directory as complete
  ceph: fix atomic_open snapdir
  ceph: re-send requests when MDS enters reconnecting stage
  ceph: fix dentry leaks
  ceph: return error for traceless reply race

 MAINTAINERS |7 +-
 drivers/block/rbd.c |  193 +--
 fs/ceph/acl.c   |   14 ---
 fs/ceph/addr.c  |   19 ++--
 fs/ceph/caps.c  |  127 +++---
 fs/ceph/dir.c   |   33 +--
 fs/ceph/file.c  |   37 +---
 fs/ceph/inode.c |   41 +
 fs/ceph/mds_client.c|  127 +++---
 fs/ceph/mds_client.h|2 +
 fs/ceph/snap.c  |   54 +++
 fs/ceph/super.c |4 +
 fs/ceph/super.h |5 +-
 include/linux/ceph/ceph_fs.h|   37 +---
 include/linux/ceph/libceph.h|3 +-
 include/linux/ceph/messenger.h  |4 +-
 include/linux/ceph/mon_client.h |9 +-
 net/ceph/ceph_common.c  |   16 +++-
 net/ceph/ceph_strings.c |   14 ---
 net/ceph/debugfs.c  |2 -
 net/ceph/messenger.c|   14 ++-
 net/ceph/mon_client.c   |  139 +---
 net/ceph/osd_client.c   |   31 +--
 23 files changed, 444 insertions(+), 488 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc7

2015-01-28 Thread Sage Weil
Hi Linus,

Please pull the following two patches from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

These paches from Ilya finally squash a race condition with layered images 
that he's been chasing for a while.

Thanks!
sage


Ilya Dryomov (2):
  rbd: fix rbd_dev_parent_get() when parent_overlap == 0
  rbd: drop parent_ref in rbd_dev_unprobe() unconditionally

 drivers/block/rbd.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc7

2015-01-28 Thread Sage Weil
Hi Linus,

Please pull the following two patches from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

These paches from Ilya finally squash a race condition with layered images 
that he's been chasing for a while.

Thanks!
sage


Ilya Dryomov (2):
  rbd: fix rbd_dev_parent_get() when parent_overlap == 0
  rbd: drop parent_ref in rbd_dev_unprobe() unconditionally

 drivers/block/rbd.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc4

2015-01-08 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client for-linus

These are both pretty trivial: a sparse warning fix and size_t printk 
thing.

Thanks!
sage


Ilya Dryomov (2):
  ceph: use %zu for len in ceph_fill_inline_data()
  libceph: fix sparse endianness warnings

 fs/ceph/addr.c  | 2 +-
 include/linux/ceph/osd_client.h | 4 ++--
 net/ceph/auth_x.c   | 2 +-
 net/ceph/mon_client.c   | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc4

2015-01-08 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client for-linus

These are both pretty trivial: a sparse warning fix and size_t printk 
thing.

Thanks!
sage


Ilya Dryomov (2):
  ceph: use %zu for len in ceph_fill_inline_data()
  libceph: fix sparse endianness warnings

 fs/ceph/addr.c  | 2 +-
 include/linux/ceph/osd_client.h | 4 ++--
 net/ceph/auth_x.c   | 2 +-
 net/ceph/mon_client.c   | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for 3.19-rc1

2014-12-17 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The big item here is support for inline data for CephFS and for message 
signatures from Zheng.  There are also several bug fixes, including 
interrupted flock request handling, 0-length xattrs, mksnap, cached 
readdir results, and a message version compat field.  Finally there are 
several cleanups from Ilya, Dan, and Markus.

Note that there is another series coming soon that fixes some bugs in the 
RBD 'lingering' requests, but it isn't quite ready yet.

Thanks!
sage



Dan Carpenter (1):
  ceph: do_sync is never initialized

Ilya Dryomov (4):
  libceph: nuke ceph_kvfree()
  ceph: remove unused stringification macros
  rbd: don't treat CEPH_OSD_OP_DELETE as extent op
  libceph: fixup includes in pagelist.h

John Spray (2):
  libceph: update ceph_msg_header structure
  ceph: message versioning fixes

SF Markus Elfring (1):
  ceph, rbd: delete unnecessary checks before two function calls

Yan, Zheng (19):
  ceph: fix file lock interruption
  ceph: introduce a new inode flag indicating if cached dentries are ordered
  libceph: store session key in cephx authorizer
  libceph: message signature support
  ceph: introduce global empty snap context
  libceph: require cephx message signature by default
  libceph: add SETXATTR/CMPXATTR osd operations support
  libceph: add CREATE osd operation support
  libceph: specify position of extent operation
  ceph: parse inline data in MClientReply and MClientCaps
  ceph: add inline data to pagecache
  ceph: use getattr request to fetch inline data
  ceph: fetch inline data when getting Fcr cap refs
  ceph: sync read inline data
  ceph: convert inline data to normal data before data write
  ceph: flush inline version
  ceph: support inline data feature
  ceph: fix mksnap crash
  ceph: fix setting empty extended attribute

 drivers/block/rbd.c|  11 +-
 fs/ceph/addr.c | 273 +++--
 fs/ceph/caps.c | 132 ++
 fs/ceph/dir.c  |  27 ++--
 fs/ceph/file.c |  97 +++--
 fs/ceph/inode.c|  59 ++--
 fs/ceph/locks.c|  64 +++--
 fs/ceph/mds_client.c   |  41 +-
 fs/ceph/mds_client.h   |  10 ++
 fs/ceph/snap.c |  37 -
 fs/ceph/super.c|  16 ++-
 fs/ceph/super.h|  55 ++--
 fs/ceph/super.h.rej|  10 ++
 fs/ceph/xattr.c|   7 +-
 include/linux/ceph/auth.h  |  26 
 include/linux/ceph/buffer.h|   3 +-
 include/linux/ceph/ceph_features.h |   1 +
 include/linux/ceph/ceph_fs.h   |  10 +-
 include/linux/ceph/libceph.h   |   2 +-
 include/linux/ceph/messenger.h |   9 +-
 include/linux/ceph/msgr.h  |  11 +-
 include/linux/ceph/osd_client.h|  13 +-
 include/linux/ceph/pagelist.h  |   4 +-
 net/ceph/auth_x.c  |  76 ++-
 net/ceph/auth_x.h  |   1 +
 net/ceph/buffer.c  |   4 +-
 net/ceph/ceph_common.c |  21 +--
 net/ceph/messenger.c   |  34 -
 net/ceph/osd_client.c  | 118 
 29 files changed, 992 insertions(+), 180 deletions(-)
 create mode 100644 fs/ceph/super.h.rej
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph updates for 3.19-rc1

2014-12-17 Thread Sage Weil
Hi Linus,

Please pull the following Ceph updates from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

The big item here is support for inline data for CephFS and for message 
signatures from Zheng.  There are also several bug fixes, including 
interrupted flock request handling, 0-length xattrs, mksnap, cached 
readdir results, and a message version compat field.  Finally there are 
several cleanups from Ilya, Dan, and Markus.

Note that there is another series coming soon that fixes some bugs in the 
RBD 'lingering' requests, but it isn't quite ready yet.

Thanks!
sage



Dan Carpenter (1):
  ceph: do_sync is never initialized

Ilya Dryomov (4):
  libceph: nuke ceph_kvfree()
  ceph: remove unused stringification macros
  rbd: don't treat CEPH_OSD_OP_DELETE as extent op
  libceph: fixup includes in pagelist.h

John Spray (2):
  libceph: update ceph_msg_header structure
  ceph: message versioning fixes

SF Markus Elfring (1):
  ceph, rbd: delete unnecessary checks before two function calls

Yan, Zheng (19):
  ceph: fix file lock interruption
  ceph: introduce a new inode flag indicating if cached dentries are ordered
  libceph: store session key in cephx authorizer
  libceph: message signature support
  ceph: introduce global empty snap context
  libceph: require cephx message signature by default
  libceph: add SETXATTR/CMPXATTR osd operations support
  libceph: add CREATE osd operation support
  libceph: specify position of extent operation
  ceph: parse inline data in MClientReply and MClientCaps
  ceph: add inline data to pagecache
  ceph: use getattr request to fetch inline data
  ceph: fetch inline data when getting Fcr cap refs
  ceph: sync read inline data
  ceph: convert inline data to normal data before data write
  ceph: flush inline version
  ceph: support inline data feature
  ceph: fix mksnap crash
  ceph: fix setting empty extended attribute

 drivers/block/rbd.c|  11 +-
 fs/ceph/addr.c | 273 +++--
 fs/ceph/caps.c | 132 ++
 fs/ceph/dir.c  |  27 ++--
 fs/ceph/file.c |  97 +++--
 fs/ceph/inode.c|  59 ++--
 fs/ceph/locks.c|  64 +++--
 fs/ceph/mds_client.c   |  41 +-
 fs/ceph/mds_client.h   |  10 ++
 fs/ceph/snap.c |  37 -
 fs/ceph/super.c|  16 ++-
 fs/ceph/super.h|  55 ++--
 fs/ceph/super.h.rej|  10 ++
 fs/ceph/xattr.c|   7 +-
 include/linux/ceph/auth.h  |  26 
 include/linux/ceph/buffer.h|   3 +-
 include/linux/ceph/ceph_features.h |   1 +
 include/linux/ceph/ceph_fs.h   |  10 +-
 include/linux/ceph/libceph.h   |   2 +-
 include/linux/ceph/messenger.h |   9 +-
 include/linux/ceph/msgr.h  |  11 +-
 include/linux/ceph/osd_client.h|  13 +-
 include/linux/ceph/pagelist.h  |   4 +-
 net/ceph/auth_x.c  |  76 ++-
 net/ceph/auth_x.h  |   1 +
 net/ceph/buffer.c  |   4 +-
 net/ceph/ceph_common.c |  21 +--
 net/ceph/messenger.c   |  34 -
 net/ceph/osd_client.c  | 118 
 29 files changed, 992 insertions(+), 180 deletions(-)
 create mode 100644 fs/ceph/super.h.rej
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc5

2014-11-13 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is an overflow bug fix for cephfs from Zheng, a fix for handling 
large authentication ticket buffers in libceph from Ilya, and a few fixes 
for the request handling code from Ilya that affect RBD volumes.

Thanks!
sage


Ilya Dryomov (4):
  libceph: do not crash on large auth tickets
  libceph: unlink from o_linger_requests when clearing r_osd
  libceph: clear r_req_lru_item in __unregister_linger_request()
  libceph: change from BUG to WARN for __remove_osd() asserts

Yan, Zheng (1):
  ceph: fix flush tid comparision

 fs/ceph/caps.c|2 +-
 net/ceph/crypto.c |  169 ++---
 net/ceph/osd_client.c |7 +-
 3 files changed, 138 insertions(+), 40 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for -rc5

2014-11-13 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is an overflow bug fix for cephfs from Zheng, a fix for handling 
large authentication ticket buffers in libceph from Ilya, and a few fixes 
for the request handling code from Ilya that affect RBD volumes.

Thanks!
sage


Ilya Dryomov (4):
  libceph: do not crash on large auth tickets
  libceph: unlink from o_linger_requests when clearing r_osd
  libceph: clear r_req_lru_item in __unregister_linger_request()
  libceph: change from BUG to WARN for __remove_osd() asserts

Yan, Zheng (1):
  ceph: fix flush tid comparision

 fs/ceph/caps.c|2 +-
 net/ceph/crypto.c |  169 ++---
 net/ceph/osd_client.c |7 +-
 3 files changed, 138 insertions(+), 40 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics

2014-11-10 Thread Sage Weil
On Wed, 5 Nov 2014, Milosz Tanski wrote:
> From: Christoph Hellwig 
> 
> With the new read/write with flags syscalls we can support a flag
> to enable O_DSYNC semantics on a per-operation basis.  This ?s
> useful to implement protocols like SMB, NFS or SCSI that have such
> per-operation flags.
> 
> Example program below:
> 
> cat > pwritev2.c << EOF
> 
> (off_t) val,  \
> (off_t) uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof 
> (long) * 4))
> 
> static ssize_t
> pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
> {
> return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
>flags);
> }
> 
> int main(int argc, char **argv)
> {
>   int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666);
>   char buf[1024];
>   struct iovec iov = { .iov_base = buf, .iov_len = 1024 };
>   int ret;
> 
> if (fd < 0) {
> perror("open");
> return 0;
> }
> 
>   memset(buf, 0xfe, sizeof(buf));
> 
>   ret = pwritev2(fd, , 1, 0, RWF_DSYNC);
>   if (ret < 0)
>   perror("pwritev2");
>   else
>   printf("ret = %d\n", ret);
> 
>   return 0;
> }
> EOF
> 
> Signed-off-by: Christoph Hellwig 
> [mil...@adfin.com: added flag check to compat_do_readv_writev()]
> Signed-off-by: Milosz Tanski 

Ceph bits

Acked-by: Sage Weil 

> ---
>  fs/ceph/file.c |  4 +++-
>  fs/fuse/file.c |  2 ++
>  fs/nfs/file.c  | 10 ++
>  fs/ocfs2/file.c|  6 --
>  fs/read_write.c| 20 +++-
>  include/linux/fs.h |  3 ++-
>  mm/filemap.c   |  4 +++-
>  7 files changed, 35 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index b798b5c..2d4e15a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -983,7 +983,9 @@ retry_snap:
>   ceph_put_cap_refs(ci, got);
>  
>   if (written >= 0 &&
> - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
> + ((file->f_flags & O_SYNC) ||
> +  IS_SYNC(file->f_mapping->host) ||
> +  (iocb->ki_rwflags & RWF_DSYNC) ||
>ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
>   err = vfs_fsync_range(file, pos, pos + written - 1, 1);
>   if (err < 0)
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index caa8d95..bb4fb23 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   written += written_buffered;
>   iocb->ki_pos = pos + written_buffered;
>   } else {
> + if (iocb->ki_rwflags & RWF_DSYNC)
> + return -EINVAL;
>   written = fuse_perform_write(file, mapping, from, pos);
>   if (written >= 0)
>   iocb->ki_pos = pos + written;
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index aa9046f..c59b0b7 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -652,13 +652,15 @@ static const struct vm_operations_struct 
> nfs_file_vm_ops = {
>   .remap_pages = generic_file_remap_pages,
>  };
>  
> -static int nfs_need_sync_write(struct file *filp, struct inode *inode)
> +static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode)
>  {
>   struct nfs_open_context *ctx;
>  
> - if (IS_SYNC(inode) || (filp->f_flags & O_DSYNC))
> + if (IS_SYNC(inode) ||
> + (iocb->ki_filp->f_flags & O_DSYNC) ||
> + (iocb->ki_rwflags & RWF_DSYNC))
>   return 1;
> - ctx = nfs_file_open_context(filp);
> + ctx = nfs_file_open_context(iocb->ki_filp);
>   if (test_bit(NFS_CONTEXT_ERROR_WRITE, >flags) ||
>   nfs_ctx_key_to_expire(ctx))
>   return 1;
> @@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct 
> iov_iter *from)
>   written = result;
>  
>   /* Return error values for O_DSYNC and IS_SYNC() */
> - if (result >= 0 && nfs_need_sync_write(file, inode)) {
> + if (result >= 0 && nfs_need_sync_write(iocb, inode)) {
>   int err = vfs_fsync(file, 0);
>   if (err < 0)
>   result = err;
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index bb66ca4..8f9a86b 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -23

Re: [PATCH v5 4/7] vfs: RWF_NONBLOCK flag for preadv2

2014-11-10 Thread Sage Weil
On Wed, 5 Nov 2014, Milosz Tanski wrote:
> generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
> only want to read the data if it's already in the page cache.
> 
> Additionally, there are a few filesystems that we have to specifically
> bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
> contributed this code.
> 
> Signed-off-by: Milosz Tanski 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Jeff Moyer 

Ceph bits

Acked-by: Sage Weil 


> ---
>  fs/ceph/file.c |  2 ++
>  fs/cifs/file.c |  6 ++
>  fs/nfs/file.c  |  5 -
>  fs/ocfs2/file.c|  6 ++
>  fs/pipe.c  |  3 ++-
>  fs/read_write.c| 38 +-
>  fs/xfs/xfs_file.c  |  4 
>  include/linux/fs.h |  3 +++
>  mm/filemap.c   | 18 ++
>  mm/shmem.c |  4 
>  10 files changed, 74 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d7e0da8..b798b5c 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -822,6 +822,8 @@ again:
>   if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
>   (iocb->ki_filp->f_flags & O_DIRECT) ||
>   (fi->flags & CEPH_F_SYNC)) {
> + if (iocb->ki_rwflags & O_NONBLOCK)
> + return -EAGAIN;
>  
>   dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
>inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 3e4d00a..c485afa 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct 
> iov_iter *to)
>   struct cifs_readdata *rdata, *tmp;
>   struct list_head rdata_list;
>  
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
>   len = iov_iter_count(to);
>   if (!len)
>   return 0;
> @@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter 
> *to)
>   ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
>   return generic_file_read_iter(iocb, to);
>  
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
> +
>   /*
>* We need to hold the sem to be sure nobody modifies lock list
>* with a brlock that prevents reading.
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 2ab6f00..aa9046f 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
>   struct inode *inode = file_inode(iocb->ki_filp);
>   ssize_t result;
>  
> - if (iocb->ki_filp->f_flags & O_DIRECT)
> + if (iocb->ki_filp->f_flags & O_DIRECT) {
> + if (iocb->ki_rwflags & O_NONBLOCK)
> + return -EAGAIN;
>   return nfs_file_direct_read(iocb, to, iocb->ki_pos);
> + }
>  
>   dprintk("NFS: read(%pD2, %zu@%lu)\n",
>   iocb->ki_filp,
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 324dc93..bb66ca4 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
>   filp->f_path.dentry->d_name.name,
>   to->nr_segs);   /* GR */
>  
> + /*
> +  * No non-blocking reads for ocfs2 for now.  Might be doable with
> +  * non-blocking cluster lock helpers.
> +  */
> + if (iocb->ki_rwflags & RWF_NONBLOCK)
> + return -EAGAIN;
>  
>   if (!inode) {
>   ret = -EINVAL;
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 21981e5..212bf68 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
>*/
>   if (ret)
>   break;
> - if (filp->f_flags & O_NONBLOCK) {
> + if ((filp->f_flags & O_NONBLOCK) ||
> + (iocb->ki_rwflags & RWF_NONBLOCK)) {
>   ret = -EAGAIN;
>   break;
>   }
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 907735c..cba7d4c 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file 
> *file,
>   file_start_write(file);
>   }
>  
> - if (iter_fn)
> + i

Re: [PATCH v5 4/7] vfs: RWF_NONBLOCK flag for preadv2

2014-11-10 Thread Sage Weil
On Wed, 5 Nov 2014, Milosz Tanski wrote:
 generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
 only want to read the data if it's already in the page cache.
 
 Additionally, there are a few filesystems that we have to specifically
 bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
 contributed this code.
 
 Signed-off-by: Milosz Tanski mil...@adfin.com
 Reviewed-by: Christoph Hellwig h...@lst.de
 Reviewed-by: Jeff Moyer jmo...@redhat.com

Ceph bits

Acked-by: Sage Weil s...@redhat.com


 ---
  fs/ceph/file.c |  2 ++
  fs/cifs/file.c |  6 ++
  fs/nfs/file.c  |  5 -
  fs/ocfs2/file.c|  6 ++
  fs/pipe.c  |  3 ++-
  fs/read_write.c| 38 +-
  fs/xfs/xfs_file.c  |  4 
  include/linux/fs.h |  3 +++
  mm/filemap.c   | 18 ++
  mm/shmem.c |  4 
  10 files changed, 74 insertions(+), 15 deletions(-)
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index d7e0da8..b798b5c 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -822,6 +822,8 @@ again:
   if ((got  (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
   (iocb-ki_filp-f_flags  O_DIRECT) ||
   (fi-flags  CEPH_F_SYNC)) {
 + if (iocb-ki_rwflags  O_NONBLOCK)
 + return -EAGAIN;
  
   dout(aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n,
inode, ceph_vinop(inode), iocb-ki_pos, (unsigned)len,
 diff --git a/fs/cifs/file.c b/fs/cifs/file.c
 index 3e4d00a..c485afa 100644
 --- a/fs/cifs/file.c
 +++ b/fs/cifs/file.c
 @@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct 
 iov_iter *to)
   struct cifs_readdata *rdata, *tmp;
   struct list_head rdata_list;
  
 + if (iocb-ki_rwflags  RWF_NONBLOCK)
 + return -EAGAIN;
 +
   len = iov_iter_count(to);
   if (!len)
   return 0;
 @@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter 
 *to)
   ((cifs_sb-mnt_cifs_flags  CIFS_MOUNT_NOPOSIXBRL) == 0))
   return generic_file_read_iter(iocb, to);
  
 + if (iocb-ki_rwflags  RWF_NONBLOCK)
 + return -EAGAIN;
 +
   /*
* We need to hold the sem to be sure nobody modifies lock list
* with a brlock that prevents reading.
 diff --git a/fs/nfs/file.c b/fs/nfs/file.c
 index 2ab6f00..aa9046f 100644
 --- a/fs/nfs/file.c
 +++ b/fs/nfs/file.c
 @@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
   struct inode *inode = file_inode(iocb-ki_filp);
   ssize_t result;
  
 - if (iocb-ki_filp-f_flags  O_DIRECT)
 + if (iocb-ki_filp-f_flags  O_DIRECT) {
 + if (iocb-ki_rwflags  O_NONBLOCK)
 + return -EAGAIN;
   return nfs_file_direct_read(iocb, to, iocb-ki_pos);
 + }
  
   dprintk(NFS: read(%pD2, %zu@%lu)\n,
   iocb-ki_filp,
 diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
 index 324dc93..bb66ca4 100644
 --- a/fs/ocfs2/file.c
 +++ b/fs/ocfs2/file.c
 @@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
   filp-f_path.dentry-d_name.name,
   to-nr_segs);   /* GR */
  
 + /*
 +  * No non-blocking reads for ocfs2 for now.  Might be doable with
 +  * non-blocking cluster lock helpers.
 +  */
 + if (iocb-ki_rwflags  RWF_NONBLOCK)
 + return -EAGAIN;
  
   if (!inode) {
   ret = -EINVAL;
 diff --git a/fs/pipe.c b/fs/pipe.c
 index 21981e5..212bf68 100644
 --- a/fs/pipe.c
 +++ b/fs/pipe.c
 @@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
*/
   if (ret)
   break;
 - if (filp-f_flags  O_NONBLOCK) {
 + if ((filp-f_flags  O_NONBLOCK) ||
 + (iocb-ki_rwflags  RWF_NONBLOCK)) {
   ret = -EAGAIN;
   break;
   }
 diff --git a/fs/read_write.c b/fs/read_write.c
 index 907735c..cba7d4c 100644
 --- a/fs/read_write.c
 +++ b/fs/read_write.c
 @@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file 
 *file,
   file_start_write(file);
   }
  
 - if (iter_fn)
 + if (iter_fn) {
   ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
   pos, iter_fn, flags);
 - else if (fnv)
 - ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 - pos, fnv);
 - else
 - ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
 + } else {
 + if (type == READ  (flags  RWF_NONBLOCK))
 + return -EAGAIN;
 +
 + if (fnv)
 + ret = do_sync_readv_writev(file, iov, nr_segs, tot_len

Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics

2014-11-10 Thread Sage Weil
On Wed, 5 Nov 2014, Milosz Tanski wrote:
 From: Christoph Hellwig h...@lst.de
 
 With the new read/write with flags syscalls we can support a flag
 to enable O_DSYNC semantics on a per-operation basis.  This ?s
 useful to implement protocols like SMB, NFS or SCSI that have such
 per-operation flags.
 
 Example program below:
 
 cat  pwritev2.c  EOF
 
 (off_t) val,  \
 (off_t) uint64_t) (val))  (sizeof (long) * 4))  (sizeof 
 (long) * 4))
 
 static ssize_t
 pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
 {
 return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
flags);
 }
 
 int main(int argc, char **argv)
 {
   int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666);
   char buf[1024];
   struct iovec iov = { .iov_base = buf, .iov_len = 1024 };
   int ret;
 
 if (fd  0) {
 perror(open);
 return 0;
 }
 
   memset(buf, 0xfe, sizeof(buf));
 
   ret = pwritev2(fd, iov, 1, 0, RWF_DSYNC);
   if (ret  0)
   perror(pwritev2);
   else
   printf(ret = %d\n, ret);
 
   return 0;
 }
 EOF
 
 Signed-off-by: Christoph Hellwig h...@lst.de
 [mil...@adfin.com: added flag check to compat_do_readv_writev()]
 Signed-off-by: Milosz Tanski mil...@adfin.com

Ceph bits

Acked-by: Sage Weil s...@redhat.com

 ---
  fs/ceph/file.c |  4 +++-
  fs/fuse/file.c |  2 ++
  fs/nfs/file.c  | 10 ++
  fs/ocfs2/file.c|  6 --
  fs/read_write.c| 20 +++-
  include/linux/fs.h |  3 ++-
  mm/filemap.c   |  4 +++-
  7 files changed, 35 insertions(+), 14 deletions(-)
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index b798b5c..2d4e15a 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -983,7 +983,9 @@ retry_snap:
   ceph_put_cap_refs(ci, got);
  
   if (written = 0 
 - ((file-f_flags  O_SYNC) || IS_SYNC(file-f_mapping-host) ||
 + ((file-f_flags  O_SYNC) ||
 +  IS_SYNC(file-f_mapping-host) ||
 +  (iocb-ki_rwflags  RWF_DSYNC) ||
ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_NEARFULL))) {
   err = vfs_fsync_range(file, pos, pos + written - 1, 1);
   if (err  0)
 diff --git a/fs/fuse/file.c b/fs/fuse/file.c
 index caa8d95..bb4fb23 100644
 --- a/fs/fuse/file.c
 +++ b/fs/fuse/file.c
 @@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, 
 struct iov_iter *from)
   written += written_buffered;
   iocb-ki_pos = pos + written_buffered;
   } else {
 + if (iocb-ki_rwflags  RWF_DSYNC)
 + return -EINVAL;
   written = fuse_perform_write(file, mapping, from, pos);
   if (written = 0)
   iocb-ki_pos = pos + written;
 diff --git a/fs/nfs/file.c b/fs/nfs/file.c
 index aa9046f..c59b0b7 100644
 --- a/fs/nfs/file.c
 +++ b/fs/nfs/file.c
 @@ -652,13 +652,15 @@ static const struct vm_operations_struct 
 nfs_file_vm_ops = {
   .remap_pages = generic_file_remap_pages,
  };
  
 -static int nfs_need_sync_write(struct file *filp, struct inode *inode)
 +static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode)
  {
   struct nfs_open_context *ctx;
  
 - if (IS_SYNC(inode) || (filp-f_flags  O_DSYNC))
 + if (IS_SYNC(inode) ||
 + (iocb-ki_filp-f_flags  O_DSYNC) ||
 + (iocb-ki_rwflags  RWF_DSYNC))
   return 1;
 - ctx = nfs_file_open_context(filp);
 + ctx = nfs_file_open_context(iocb-ki_filp);
   if (test_bit(NFS_CONTEXT_ERROR_WRITE, ctx-flags) ||
   nfs_ctx_key_to_expire(ctx))
   return 1;
 @@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct 
 iov_iter *from)
   written = result;
  
   /* Return error values for O_DSYNC and IS_SYNC() */
 - if (result = 0  nfs_need_sync_write(file, inode)) {
 + if (result = 0  nfs_need_sync_write(iocb, inode)) {
   int err = vfs_fsync(file, 0);
   if (err  0)
   result = err;
 diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
 index bb66ca4..8f9a86b 100644
 --- a/fs/ocfs2/file.c
 +++ b/fs/ocfs2/file.c
 @@ -2374,8 +2374,10 @@ out_dio:
   /* buffered aio wouldn't have proper lock coverage today */
   BUG_ON(ret == -EIOCBQUEUED  !(file-f_flags  O_DIRECT));
  
 - if (((file-f_flags  O_DSYNC)  !direct_io) || IS_SYNC(inode) ||
 - ((file-f_flags  O_DIRECT)  !direct_io)) {
 + if (((file-f_flags  O_DSYNC)  !direct_io) ||
 + IS_SYNC(inode) ||
 + ((file-f_flags  O_DIRECT)  !direct_io) ||
 + (iocb-ki_rwflags  RWF_DSYNC)) {
   ret = filemap_fdatawrite_range(file-f_mapping, *ppos,
  *ppos + count - 1);
   if (ret  0)
 diff --git a/fs/read_write.c b/fs/read_write.c
 index cba7d4c..3443265 100644

[GIT PULL] Ceph fixes for 3.18

2014-11-03 Thread Sage Weil
Hi Linus,

Please pull the following fixes for RBD from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is a GFP flag fix from Mike Christie, an error code fix from Jan, 
and fixes for two unnecessary allocations (kmalloc and workqueue) from 
Ilya. All are well tested.

Ilya has one other fix on the way but it didn't get tested in time.

Thanks!
sage



Ilya Dryomov (2):
  rbd: use a single workqueue for all devices
  libceph: eliminate unnecessary allocation in process_one_ticket()

Jan Kara (1):
  rbd: Fix error recovery in rbd_obj_read_sync()

Mike Christie (1):
  libceph: use memalloc flags for net IO

 drivers/block/rbd.c  | 35 +++
 net/ceph/auth_x.c| 25 ++---
 net/ceph/messenger.c | 10 +-
 3 files changed, 38 insertions(+), 32 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Ceph fixes for 3.18

2014-11-03 Thread Sage Weil
Hi Linus,

Please pull the following fixes for RBD from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

There is a GFP flag fix from Mike Christie, an error code fix from Jan, 
and fixes for two unnecessary allocations (kmalloc and workqueue) from 
Ilya. All are well tested.

Ilya has one other fix on the way but it didn't get tested in time.

Thanks!
sage



Ilya Dryomov (2):
  rbd: use a single workqueue for all devices
  libceph: eliminate unnecessary allocation in process_one_ticket()

Jan Kara (1):
  rbd: Fix error recovery in rbd_obj_read_sync()

Mike Christie (1):
  libceph: use memalloc flags for net IO

 drivers/block/rbd.c  | 35 +++
 net/ceph/auth_x.c| 25 ++---
 net/ceph/messenger.c | 10 +-
 3 files changed, 38 insertions(+), 32 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >