Re: linux-next: new contact(s) for the ceph tree?
Jeff Layton Thanks, Stephen! sage On Sat, 9 May 2020, Stephen Rothwell wrote: > Hi all, > > I noticed commit > > 3a5ccecd9af7 ("MAINTAINERS: remove myself as ceph co-maintainer") > > appear recently. So who should I now list as the contact(s) for the > ceph tree? > > -- > Cheers, > Stephen Rothwell >
Re: [RFC PATCH] ceph: initialize superblock s_time_gran to 1
On Thu, 27 Jun 2019, Jeff Layton wrote: > On Thu, 2019-06-27 at 14:51 +0100, Luis Henriques wrote: > > Having granularity set to 1us results in having inode timestamps with a > > accurancy different from the fuse client (i.e. atime, ctime and mtime will > > always end with '000'). This patch normalizes this behaviour and sets the > > granularity to 1. > > > > Signed-off-by: Luis Henriques > > --- > > fs/ceph/super.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > Hi! > > > > As far as I could see there are no other side-effects of changing > > s_time_gran but I'm really not sure why it was initially set to 1000 in > > the first place so I may be missing something. > > > > diff --git a/fs/ceph/super.c b/fs/ceph/super.c > > index d57fa60dcd43..35dd75bc9cd0 100644 > > --- a/fs/ceph/super.c > > +++ b/fs/ceph/super.c > > @@ -980,7 +980,7 @@ static int ceph_set_super(struct super_block *s, void > > *data) > > s->s_d_op = _dentry_ops; > > s->s_export_op = _export_ops; > > > > - s->s_time_gran = 1000; /* 1000 ns == 1 us */ > > + s->s_time_gran = 1; > > > > ret = set_anon_super(s, NULL); /* what is that second arg for? */ > > if (ret != 0) > > > Looks like it was set that way since the client code was originally > merged. Was this an earlier limitation of ceph that is no longer > applicable? > > In any case, I see no need at all to keep this at 1000, so: As long as the encoded on-write time value is at ns resolution, I agree! No recollection of why I did this :( Reviewed-by: Sage Weil
[GIT PULL] Ceph fixes for -rc2
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus We have a few follow-up fixes for the libceph refactor from Ilya, and then some cephfs + fscache fixes from Zheng. The first two FS-Cache patches are acked by David Howells and deemed trivial enough to go through our tree. The rest fix some issues with the ceph fscache handling (disable cache for inodes opened for write, and simplify the revalidation logic accordingly, dropping the now-unnecessary work queue). Thanks! sage Ilya Dryomov (3): libceph: change ceph_osdmap_flag() to take osdc libceph: put request only if it's done in handle_reply() libceph: use %s instead of %pE in dout()s Yan, Zheng (7): FS-Cache: wake write waiter after invalidating writes FS-Cache: make check_consistency callback return int ceph: call __fscache_uncache_page() if readpages fails ceph: avoid unnecessary fscache invalidation/revlidation ceph: disable fscache when inode is opened for write ceph: improve fscache revalidation ceph: use i_version to check validity of fscache fs/cachefiles/interface.c | 2 +- fs/ceph/addr.c | 6 +- fs/ceph/cache.c | 141 +--- fs/ceph/cache.h | 44 - fs/ceph/caps.c | 23 +++ fs/ceph/file.c | 27 ++-- fs/ceph/super.h | 4 +- fs/fscache/page.c | 2 + include/linux/ceph/osd_client.h | 5 ++ include/linux/ceph/osdmap.h | 5 -- include/linux/fscache-cache.h | 2 +- net/ceph/osd_client.c | 51 +++ net/ceph/osdmap.c | 4 +- 13 files changed, 138 insertions(+), 178 deletions(-)
[GIT PULL] Ceph fixes for -rc2
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus We have a few follow-up fixes for the libceph refactor from Ilya, and then some cephfs + fscache fixes from Zheng. The first two FS-Cache patches are acked by David Howells and deemed trivial enough to go through our tree. The rest fix some issues with the ceph fscache handling (disable cache for inodes opened for write, and simplify the revalidation logic accordingly, dropping the now-unnecessary work queue). Thanks! sage Ilya Dryomov (3): libceph: change ceph_osdmap_flag() to take osdc libceph: put request only if it's done in handle_reply() libceph: use %s instead of %pE in dout()s Yan, Zheng (7): FS-Cache: wake write waiter after invalidating writes FS-Cache: make check_consistency callback return int ceph: call __fscache_uncache_page() if readpages fails ceph: avoid unnecessary fscache invalidation/revlidation ceph: disable fscache when inode is opened for write ceph: improve fscache revalidation ceph: use i_version to check validity of fscache fs/cachefiles/interface.c | 2 +- fs/ceph/addr.c | 6 +- fs/ceph/cache.c | 141 +--- fs/ceph/cache.h | 44 - fs/ceph/caps.c | 23 +++ fs/ceph/file.c | 27 ++-- fs/ceph/super.h | 4 +- fs/fscache/page.c | 2 + include/linux/ceph/osd_client.h | 5 ++ include/linux/ceph/osdmap.h | 5 -- include/linux/fscache-cache.h | 2 +- net/ceph/osd_client.c | 51 +++ net/ceph/osdmap.c | 4 +- 13 files changed, 138 insertions(+), 178 deletions(-)
Re: [GIT PULL] Ceph updates for 4.7-rc1
On Thu, 26 May 2016, Linus Torvalds wrote: > On Thu, May 26, 2016 at 11:31 AM, Linus Torvalds >wrote: > > > > Pulled and then immediately unpulled again. > > .. and having thought it over, I ended up re-pulling again, so now > it's going through my build test. > > Consider this discussion a strong encouragement to *not* do this in > the future - sending me pull requests at the end of the merge window > without them having been in linux-next is a no-no, unless those pull > requests are small and trivial (or have fixes that I'd pull even > outside the merge window, of course). Thank you! We'll be sure we include things in -next well beforehand next time around, especially if it's a big diff like this one. One point of clarification, though: in the past I've squashed down fixes discovered during testing if the branch hasn't hit a stable tree yet (e.g., your tree). AIUI this is(was?) standard procedure for things in -next. Do you want us to avoid squashing if we are creeping up on pull request time, or are you primarily interested in, say, seeing that what has been in -next for a while is substantially the same as what you pull, and has perhaps been there unmodified for at least a few days? Or would you rather see fixup patches if we identify issues in the last few days of testing? Thanks- sage
Re: [GIT PULL] Ceph updates for 4.7-rc1
On Thu, 26 May 2016, Linus Torvalds wrote: > On Thu, May 26, 2016 at 11:31 AM, Linus Torvalds > wrote: > > > > Pulled and then immediately unpulled again. > > .. and having thought it over, I ended up re-pulling again, so now > it's going through my build test. > > Consider this discussion a strong encouragement to *not* do this in > the future - sending me pull requests at the end of the merge window > without them having been in linux-next is a no-no, unless those pull > requests are small and trivial (or have fixes that I'd pull even > outside the merge window, of course). Thank you! We'll be sure we include things in -next well beforehand next time around, especially if it's a big diff like this one. One point of clarification, though: in the past I've squashed down fixes discovered during testing if the branch hasn't hit a stable tree yet (e.g., your tree). AIUI this is(was?) standard procedure for things in -next. Do you want us to avoid squashing if we are creeping up on pull request time, or are you primarily interested in, say, seeing that what has been in -next for a while is substantially the same as what you pull, and has perhaps been there unmodified for at least a few days? Or would you rather see fixup patches if we identify issues in the last few days of testing? Thanks- sage
Re: [GIT PULL] Ceph updates for 4.7-rc1
On Thu, 26 May 2016, Linus Torvalds wrote: > On Thu, May 26, 2016 at 11:18 AM, Sage Weil <sw...@redhat.com> wrote: > > > > Please pull the following Ceph updates from > > Why was that branch rebased yesterday? > > What has been in next, if anything? > > And if something has been in next, why was _that_ not sent to me? The branch was assembled in its current form yesterday and is included in today's -next: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e536030934aebf049fe6aaebc58dd37aeee21840 The same commit went through our internal testing last night, and we've been testing the code for the better part of a week internally. If you want it to bake longer in -next first, let us know. We're not causing merge conflicts, and there isn't -next-based ceph testing that I'm aware of going on outside of our own QA environment, so I'm not sure how valuable it is, but I'm happy to delay before sending a pull request if that's what you want to see. Thanks- sage
Re: [GIT PULL] Ceph updates for 4.7-rc1
On Thu, 26 May 2016, Linus Torvalds wrote: > On Thu, May 26, 2016 at 11:18 AM, Sage Weil wrote: > > > > Please pull the following Ceph updates from > > Why was that branch rebased yesterday? > > What has been in next, if anything? > > And if something has been in next, why was _that_ not sent to me? The branch was assembled in its current form yesterday and is included in today's -next: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e536030934aebf049fe6aaebc58dd37aeee21840 The same commit went through our internal testing last night, and we've been testing the code for the better part of a week internally. If you want it to bake longer in -next first, let us know. We're not causing merge conflicts, and there isn't -next-based ceph testing that I'm aware of going on outside of our own QA environment, so I'm not sure how valuable it is, but I'm happy to delay before sending a pull request if that's what you want to see. Thanks- sage
[GIT PULL] Ceph updates for 4.7-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This changeset has a few main parts: * Ilya has finished a huge refactoring effort to sync up the client-side logic in libceph with the user-space client code, which has evolved significantly over the last couple years, with lots of additional behaviors (e.g., how requests are handled when cluster is full and transitions from full to non-full). This structure of the code is more closely aligned with userspace now such that it will be much easier to maintain going forward when behavior changes take place. There are some locking improvements bundled in as well. * Zheng adds multi-filesystem support (multiple namespaces within the same Ceph cluster) * Zheng has changed the readdir offsets and directory enumeration so that dentry offsets are hash-based and therefore stable across directory fragmentation events on the MDS. * Zheng has a smorgasbord of bug fixes across fs/ceph. Thanks! sage Ilya Dryomov (40): rbd: get/put img_request in rbd_img_request_submit() libceph: make ceph_osdc_put_request() accept NULL libceph: grab snapc in ceph_osdc_alloc_request() libceph: move message allocation out of ceph_osdc_alloc_request() libceph: change how osd_op_reply message size is calculated libceph: variable-sized ceph_object_id rbd: use header_oid instead of header_name libceph: nuke unused fields and functions libceph: open-code remove_{all,old}_osds() libceph: DEFINE_RB_FUNCS macro libceph: fix ceph_eversion encoding libceph: rename ceph_oloc_oid_to_pg() libceph: ceph_osds, ceph_pg_to_up_acting_osds() libceph: rename ceph_calc_pg_primary() libceph: make pgid_cmp() global libceph: pi->min_size, pi->last_force_request_resend libceph: introduce ceph_osd_request_target, calc_target() libceph: switch to calc_target(), part 1 libceph: switch to calc_target(), part 2 libceph: drop msg argument from ceph_osdc_callback_t libceph: redo callbacks and factor out MOSDOpReply decoding libceph: move schedule_delayed_work() in ceph_osdc_init() libceph: schedule tick from ceph_osdc_init() libceph: allocate dummy osdmap in ceph_osdc_init() libceph: handle_one_map() libceph: osd_init() and osd_cleanup() libceph: allocate ceph_osd with GFP_NOFAIL libceph: protect osdc->osd_lru list with a spinlock libceph: a major OSD client update libceph: request_init() and request_release_checks() libceph: wait_request_timeout() rbd: rbd_dev_header_unwatch_sync() variant libceph, rbd: ceph_osd_linger_request, watch/notify v2 libceph: support for sending notifies libceph: support for checking on status of watch libceph: async MON client generic requests libceph: pool deletion detection libceph: take osdc->lock in osdmap_show() and dump flags in hex libceph: replace ceph_monc_request_next_osdmap() libceph: support for subscribing to "mdsmap." maps Yan, Zheng (30): ceph: multiple filesystem support ceph: CEPH_FEATURE_MDSENC support ceph: renew caps for read/write if mds session got killed. ceph: don't call truncate_pagecache in ceph_writepages_start ceph: don't show symlink target in debugfs/mdsc ceph: report mount root in session metadata ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr ceph: search cache postion for dcache readdir ceph: remove unnecessary checks in __dcache_readdir ceph: simplify 'offset in frag' ceph: define struct for dir entry in readdir reply ceph: define 'end/complete' in readdir reply as bit flags ceph: record 'offset' for each entry of readdir result ceph: don't forbid marking directory complete after forward seek ceph: using hash value to compose dentry offset ceph: fix inode reference leak ceph: don't assume frag tree splits in mds reply are sorted ceph: fix dir_auth check in ceph_fill_dirfrag() ceph: keep leaf frag when updating fragtree ceph: improve fragtree change detection ceph: tolerate bad i_size for symlink inode ceph: block non-fatal signals for fault/page_mkwrite ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM ceph: handle -EAGAIN returned by ceph_update_writeable_page() libceph: make ceph_osdc_wait_request() uninterruptible ceph: make ceph_update_writeable_page() uninterruptible ceph: handle interrupted ceph_writepage() ceph: SetPageError() for writeback pages if writepages fails ceph: don't use truncate_pagecache() to invalidate read cache ceph: fix wake_up_session_cb() Zhang Zhuoyu (1): ceph: make logical calculation functions return bool
[GIT PULL] Ceph updates for 4.7-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This changeset has a few main parts: * Ilya has finished a huge refactoring effort to sync up the client-side logic in libceph with the user-space client code, which has evolved significantly over the last couple years, with lots of additional behaviors (e.g., how requests are handled when cluster is full and transitions from full to non-full). This structure of the code is more closely aligned with userspace now such that it will be much easier to maintain going forward when behavior changes take place. There are some locking improvements bundled in as well. * Zheng adds multi-filesystem support (multiple namespaces within the same Ceph cluster) * Zheng has changed the readdir offsets and directory enumeration so that dentry offsets are hash-based and therefore stable across directory fragmentation events on the MDS. * Zheng has a smorgasbord of bug fixes across fs/ceph. Thanks! sage Ilya Dryomov (40): rbd: get/put img_request in rbd_img_request_submit() libceph: make ceph_osdc_put_request() accept NULL libceph: grab snapc in ceph_osdc_alloc_request() libceph: move message allocation out of ceph_osdc_alloc_request() libceph: change how osd_op_reply message size is calculated libceph: variable-sized ceph_object_id rbd: use header_oid instead of header_name libceph: nuke unused fields and functions libceph: open-code remove_{all,old}_osds() libceph: DEFINE_RB_FUNCS macro libceph: fix ceph_eversion encoding libceph: rename ceph_oloc_oid_to_pg() libceph: ceph_osds, ceph_pg_to_up_acting_osds() libceph: rename ceph_calc_pg_primary() libceph: make pgid_cmp() global libceph: pi->min_size, pi->last_force_request_resend libceph: introduce ceph_osd_request_target, calc_target() libceph: switch to calc_target(), part 1 libceph: switch to calc_target(), part 2 libceph: drop msg argument from ceph_osdc_callback_t libceph: redo callbacks and factor out MOSDOpReply decoding libceph: move schedule_delayed_work() in ceph_osdc_init() libceph: schedule tick from ceph_osdc_init() libceph: allocate dummy osdmap in ceph_osdc_init() libceph: handle_one_map() libceph: osd_init() and osd_cleanup() libceph: allocate ceph_osd with GFP_NOFAIL libceph: protect osdc->osd_lru list with a spinlock libceph: a major OSD client update libceph: request_init() and request_release_checks() libceph: wait_request_timeout() rbd: rbd_dev_header_unwatch_sync() variant libceph, rbd: ceph_osd_linger_request, watch/notify v2 libceph: support for sending notifies libceph: support for checking on status of watch libceph: async MON client generic requests libceph: pool deletion detection libceph: take osdc->lock in osdmap_show() and dump flags in hex libceph: replace ceph_monc_request_next_osdmap() libceph: support for subscribing to "mdsmap." maps Yan, Zheng (30): ceph: multiple filesystem support ceph: CEPH_FEATURE_MDSENC support ceph: renew caps for read/write if mds session got killed. ceph: don't call truncate_pagecache in ceph_writepages_start ceph: don't show symlink target in debugfs/mdsc ceph: report mount root in session metadata ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr ceph: search cache postion for dcache readdir ceph: remove unnecessary checks in __dcache_readdir ceph: simplify 'offset in frag' ceph: define struct for dir entry in readdir reply ceph: define 'end/complete' in readdir reply as bit flags ceph: record 'offset' for each entry of readdir result ceph: don't forbid marking directory complete after forward seek ceph: using hash value to compose dentry offset ceph: fix inode reference leak ceph: don't assume frag tree splits in mds reply are sorted ceph: fix dir_auth check in ceph_fill_dirfrag() ceph: keep leaf frag when updating fragtree ceph: improve fragtree change detection ceph: tolerate bad i_size for symlink inode ceph: block non-fatal signals for fault/page_mkwrite ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM ceph: handle -EAGAIN returned by ceph_update_writeable_page() libceph: make ceph_osdc_wait_request() uninterruptible ceph: make ceph_update_writeable_page() uninterruptible ceph: handle interrupted ceph_writepage() ceph: SetPageError() for writeback pages if writepages fails ceph: don't use truncate_pagecache() to invalidate read cache ceph: fix wake_up_session_cb() Zhang Zhuoyu (1): ceph: make logical calculation functions return bool
[GIT PULL] Ceph fixes for -rc6
Hi Linus, git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is a lifecycle fix in the auth code, a fix for a narrow race condition on map, and a helpful message in the log when there is a feature mismatch (which happens frequently now that the default server-side options have changed). Thanks! sage Ilya Dryomov (3): libceph: make authorizer destruction independent of ceph_auth_client rbd: fix rbd map vs notify races rbd: report unsupported features to syslog drivers/block/rbd.c | 52 +++--- fs/ceph/mds_client.c| 6 ++-- include/linux/ceph/auth.h | 10 +++--- include/linux/ceph/osd_client.h | 1 - net/ceph/auth.c | 8 ++--- net/ceph/auth_none.c| 71 ++--- net/ceph/auth_none.h| 3 +- net/ceph/auth_x.c | 21 ++-- net/ceph/auth_x.h | 1 + net/ceph/osd_client.c | 6 ++-- 10 files changed, 87 insertions(+), 92 deletions(-)
[GIT PULL] Ceph fixes for -rc6
Hi Linus, git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is a lifecycle fix in the auth code, a fix for a narrow race condition on map, and a helpful message in the log when there is a feature mismatch (which happens frequently now that the default server-side options have changed). Thanks! sage Ilya Dryomov (3): libceph: make authorizer destruction independent of ceph_auth_client rbd: fix rbd map vs notify races rbd: report unsupported features to syslog drivers/block/rbd.c | 52 +++--- fs/ceph/mds_client.c| 6 ++-- include/linux/ceph/auth.h | 10 +++--- include/linux/ceph/osd_client.h | 1 - net/ceph/auth.c | 8 ++--- net/ceph/auth_none.c| 71 ++--- net/ceph/auth_none.h| 3 +- net/ceph/auth_x.c | 21 ++-- net/ceph/auth_x.h | 1 + net/ceph/osd_client.c | 6 ++-- 10 files changed, 87 insertions(+), 92 deletions(-)
[GIT PULL] Ceph fix for -rc3
[This time with correct To: line :)] Hi Linus, Please pull the following Ceph RBD patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This just fixes a few remaining memory allocations in RBD to use GFP_NOIO instead of GFP_ATOMIC. Thanks! sage David Disseldorp (1): rbd: use GFP_NOIO consistently for request allocations drivers/block/rbd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[GIT PULL] Ceph fix for -rc3
[This time with correct To: line :)] Hi Linus, Please pull the following Ceph RBD patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This just fixes a few remaining memory allocations in RBD to use GFP_NOIO instead of GFP_ATOMIC. Thanks! sage David Disseldorp (1): rbd: use GFP_NOIO consistently for request allocations drivers/block/rbd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[GIT PULL] Ceph fix for -rc3
Hi Linus, Please pull the following Ceph RBD patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This just fixes a few remaining memory allocations in RBD to use GFP_NOIO instead of GFP_ATOMIC. Thanks! sage David Disseldorp (1): rbd: use GFP_NOIO consistently for request allocations drivers/block/rbd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[GIT PULL] Ceph fix for -rc3
Hi Linus, Please pull the following Ceph RBD patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This just fixes a few remaining memory allocations in RBD to use GFP_NOIO instead of GFP_ATOMIC. Thanks! sage David Disseldorp (1): rbd: use GFP_NOIO consistently for request allocations drivers/block/rbd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
[GIT PULL] Ceph updates for 4.6-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others. This series is based on a recent merge of Al's tree to avoid conflicts with his splice_dentry changes. Thanks! sage Anton Protopopov (1): ceph: fix a wrong comparison Deepa Dinamani (1): ceph: replace CURRENT_TIME by current_fs_time() Geliang Tang (3): rbd: use KMEM_CACHE macro ceph: use kmem_cache_zalloc libceph: use KMEM_CACHE macro Ilya Dryomov (15): libceph: move debugfs initialization into __ceph_open_session() libceph: decouple hunting and subs management libceph: revamp subs code, switch to SUBSCRIBE2 protocol libceph: pick a different monitor when reconnecting libceph: monc ping rate is 10s libceph: monc hunt rate is 3s with backoff up to 30s libceph: introduce and switch to reopen_session() libceph: reschedule tick in mon_fault() libceph: behave in mon_fault() if cur_mon < 0 libceph: rename ceph_osd_req_op::payload_len to indata_len libceph: make r_request msg_size calculation clearer libceph: osdc->req_mempool should be backed by a slab pool libceph: enable large, variable-sized OSD requests ceph: kill ceph_empty_snapc libceph: use sizeof_footer() more Yan, Zheng (14): ceph: encode ctime in cap message ceph: don't enable rbytes mount option by default ceph: remove useless BUG_ON libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op libceph: add helper that duplicates last extent operation ceph: scattered page writeback ceph: fix race during filling readdir cache ceph: avoid updating directory inode's i_size accidentally ceph: remove unnecessary NULL check ceph: fix mounting same fs multiple times ceph: don't request vxattrs from MDS ceph: fix security xattr deadlock ceph: kill ceph_get_dentry_parent_inode() ceph: use lookup request to revalidate dentry drivers/block/rbd.c| 14 +- fs/ceph/addr.c | 324 -- fs/ceph/caps.c | 11 +- fs/ceph/dir.c | 69 -- fs/ceph/export.c | 13 ++ fs/ceph/file.c | 15 +- fs/ceph/inode.c| 34 ++- fs/ceph/mds_client.c | 7 +- fs/ceph/snap.c | 16 -- fs/ceph/super.c| 47 ++-- fs/ceph/super.h| 23 +- fs/ceph/xattr.c| 78 ++- include/linux/ceph/ceph_features.h | 2 + include/linux/ceph/ceph_fs.h | 7 +- include/linux/ceph/libceph.h | 8 +- include/linux/ceph/mon_client.h| 31 ++- include/linux/ceph/osd_client.h| 15 +- net/ceph/ceph_common.c | 4 +- net/ceph/debugfs.c | 17 +- net/ceph/messenger.c | 29 +-- net/ceph/mon_client.c | 457 - net/ceph/osd_client.c | 109 ++--- 22 files changed, 811 insertions(+), 519 deletions(-)
[GIT PULL] Ceph updates for 4.6-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others. This series is based on a recent merge of Al's tree to avoid conflicts with his splice_dentry changes. Thanks! sage Anton Protopopov (1): ceph: fix a wrong comparison Deepa Dinamani (1): ceph: replace CURRENT_TIME by current_fs_time() Geliang Tang (3): rbd: use KMEM_CACHE macro ceph: use kmem_cache_zalloc libceph: use KMEM_CACHE macro Ilya Dryomov (15): libceph: move debugfs initialization into __ceph_open_session() libceph: decouple hunting and subs management libceph: revamp subs code, switch to SUBSCRIBE2 protocol libceph: pick a different monitor when reconnecting libceph: monc ping rate is 10s libceph: monc hunt rate is 3s with backoff up to 30s libceph: introduce and switch to reopen_session() libceph: reschedule tick in mon_fault() libceph: behave in mon_fault() if cur_mon < 0 libceph: rename ceph_osd_req_op::payload_len to indata_len libceph: make r_request msg_size calculation clearer libceph: osdc->req_mempool should be backed by a slab pool libceph: enable large, variable-sized OSD requests ceph: kill ceph_empty_snapc libceph: use sizeof_footer() more Yan, Zheng (14): ceph: encode ctime in cap message ceph: don't enable rbytes mount option by default ceph: remove useless BUG_ON libceph: move r_reply_op_{len,result} into struct ceph_osd_req_op libceph: add helper that duplicates last extent operation ceph: scattered page writeback ceph: fix race during filling readdir cache ceph: avoid updating directory inode's i_size accidentally ceph: remove unnecessary NULL check ceph: fix mounting same fs multiple times ceph: don't request vxattrs from MDS ceph: fix security xattr deadlock ceph: kill ceph_get_dentry_parent_inode() ceph: use lookup request to revalidate dentry drivers/block/rbd.c| 14 +- fs/ceph/addr.c | 324 -- fs/ceph/caps.c | 11 +- fs/ceph/dir.c | 69 -- fs/ceph/export.c | 13 ++ fs/ceph/file.c | 15 +- fs/ceph/inode.c| 34 ++- fs/ceph/mds_client.c | 7 +- fs/ceph/snap.c | 16 -- fs/ceph/super.c| 47 ++-- fs/ceph/super.h| 23 +- fs/ceph/xattr.c| 78 ++- include/linux/ceph/ceph_features.h | 2 + include/linux/ceph/ceph_fs.h | 7 +- include/linux/ceph/libceph.h | 8 +- include/linux/ceph/mon_client.h| 31 ++- include/linux/ceph/osd_client.h| 15 +- net/ceph/ceph_common.c | 4 +- net/ceph/debugfs.c | 17 +- net/ceph/messenger.c | 29 +-- net/ceph/mon_client.c | 457 - net/ceph/osd_client.c | 109 ++--- 22 files changed, 811 insertions(+), 519 deletions(-)
Re: [ceph] what's going on with d_rehash() in splice_dentry()?
On Mon, 7 Mar 2016, Al Viro wrote: > On Wed, Mar 02, 2016 at 11:00:01AM +0800, Yan, Zheng wrote: > > > > This code dates back to when Ceph was originally upstreamed, so the > > > history is murky, but I expect at that point I wanted to avoid hashing in > > > the no-lease case. But I don't think it matters. We should just remove > > > the prehash argument from splice_dentry entirely. > > > > > > Zheng, does that sound right? > > > > Yes. I think we can remove the d_rehash(dn) call and rehash parameter. > > Another question in the same general area: > /* null dentry? */ > if (!rinfo->head->is_target) { > dout("fill_trace null dentry\n"); > if (d_really_is_positive(dn)) { > ceph_dir_clear_ordered(dir); > dout("d_delete %p\n", dn); > d_delete(dn); > } else { > dout("d_instantiate %p NULL\n", dn); > d_instantiate(dn, NULL); > if (have_lease && d_unhashed(dn)) > d_rehash(dn); > update_dentry_lease(dn, rinfo->dlease, > session, > req->r_request_started); > } > goto done; > } > What's that d_instantiate() about? We have just checked that it's > negative; what's the point of setting ->d_inode to NULL again? Would it > be OK if we just do > } else { > if (have_lease && d_unhashed(dn)) > d_add(dn, NULL); > update_dentry_lease(dn, rinfo->dlease, > session, > req->r_request_started); > } > in there? That looks okay, but changing d_rehash to d_add still means you're doing te d_instantiate(dn, NULL) in the d_unhashed case; is there a reason you changed that line? Is the dentry_rcuwalk_invalidate in __d_instantiate is important before rehashing? > As an aside, tracking back to the originating fs method is > painful as hell ;-/ I _think_ that rehash can be hit during ->lookup() > returning a negative, but I wouldn't bet a dime on it not happening from > other methods... AFAICS, the change should be OK regardless of what > it's been called from, but... _ouch_. Is is documented anywhere public? It is a pain to follow, yes. FWIW this whole block is predicated in req->r_locked_dir being non-NULL (i.e., VFS holds dir->i_mutex), which is only true for lookup, create operations (mkdir/mknod/symlink/etc.), atomic_open, and the .get_name export op. There's not much documentation beyond a description of the meaning of fields (e.g. r_locked_dir) in fs/ceph/mds_client.h ... sage
Re: [ceph] what's going on with d_rehash() in splice_dentry()?
On Mon, 7 Mar 2016, Al Viro wrote: > On Wed, Mar 02, 2016 at 11:00:01AM +0800, Yan, Zheng wrote: > > > > This code dates back to when Ceph was originally upstreamed, so the > > > history is murky, but I expect at that point I wanted to avoid hashing in > > > the no-lease case. But I don't think it matters. We should just remove > > > the prehash argument from splice_dentry entirely. > > > > > > Zheng, does that sound right? > > > > Yes. I think we can remove the d_rehash(dn) call and rehash parameter. > > Another question in the same general area: > /* null dentry? */ > if (!rinfo->head->is_target) { > dout("fill_trace null dentry\n"); > if (d_really_is_positive(dn)) { > ceph_dir_clear_ordered(dir); > dout("d_delete %p\n", dn); > d_delete(dn); > } else { > dout("d_instantiate %p NULL\n", dn); > d_instantiate(dn, NULL); > if (have_lease && d_unhashed(dn)) > d_rehash(dn); > update_dentry_lease(dn, rinfo->dlease, > session, > req->r_request_started); > } > goto done; > } > What's that d_instantiate() about? We have just checked that it's > negative; what's the point of setting ->d_inode to NULL again? Would it > be OK if we just do > } else { > if (have_lease && d_unhashed(dn)) > d_add(dn, NULL); > update_dentry_lease(dn, rinfo->dlease, > session, > req->r_request_started); > } > in there? That looks okay, but changing d_rehash to d_add still means you're doing te d_instantiate(dn, NULL) in the d_unhashed case; is there a reason you changed that line? Is the dentry_rcuwalk_invalidate in __d_instantiate is important before rehashing? > As an aside, tracking back to the originating fs method is > painful as hell ;-/ I _think_ that rehash can be hit during ->lookup() > returning a negative, but I wouldn't bet a dime on it not happening from > other methods... AFAICS, the change should be OK regardless of what > it's been called from, but... _ouch_. Is is documented anywhere public? It is a pain to follow, yes. FWIW this whole block is predicated in req->r_locked_dir being non-NULL (i.e., VFS holds dir->i_mutex), which is only true for lookup, create operations (mkdir/mknod/symlink/etc.), atomic_open, and the .get_name export op. There's not much documentation beyond a description of the meaning of fields (e.g. r_locked_dir) in fs/ceph/mds_client.h ... sage
[GIT PULL] Ceph fixes for -rc7
Hi Linus, Please pull the following Ceph patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This is a final commit we missed to align the protocol compatibility with the feature bits. It decodes a few extra fields in two different messages and reports EIO when they are used (not yet supported). Thanks! sage Yan, Zheng (1): ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support fs/ceph/addr.c | 4 fs/ceph/caps.c | 27 --- fs/ceph/inode.c| 2 ++ fs/ceph/mds_client.c | 16 fs/ceph/mds_client.h | 1 + fs/ceph/super.h| 1 + include/linux/ceph/ceph_features.h | 1 + 7 files changed, 49 insertions(+), 3 deletions(-)
[GIT PULL] Ceph fixes for -rc7
Hi Linus, Please pull the following Ceph patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This is a final commit we missed to align the protocol compatibility with the feature bits. It decodes a few extra fields in two different messages and reports EIO when they are used (not yet supported). Thanks! sage Yan, Zheng (1): ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support fs/ceph/addr.c | 4 fs/ceph/caps.c | 27 --- fs/ceph/inode.c| 2 ++ fs/ceph/mds_client.c | 16 fs/ceph/mds_client.h | 1 + fs/ceph/super.h| 1 + include/linux/ceph/ceph_features.h | 1 + 7 files changed, 49 insertions(+), 3 deletions(-)
Re: [ceph] what's going on with d_rehash() in splice_dentry()?
Hi Al, On Fri, 26 Feb 2016, Al Viro wrote: > You have, modulo printks and BUG_ON(), > { > struct dentry *realdn; > /* dn must be unhashed */ > if (!d_unhashed(dn)) > d_drop(dn); > realdn = d_splice_alias(in, dn); > if (IS_ERR(realdn)) { > if (prehash) > *prehash = false; /* don't rehash on error */ > dn = realdn; /* note realdn contains the error */ > goto out; > } else if (realdn) { > dput(dn); > dn = realdn; > } > if ((!prehash || *prehash) && d_unhashed(dn)) > d_rehash(dn); > > When d_splice_alias() returns NULL it has hashed the dentry you'd given it; > when it returns a different dentry, that dentry is also returned hashed. > IOW, d_rehash(dn) in there should never be called. > > If you have a case when it _is_ called, you've found a bug somewhere and > I'd like to see details. AFAICS, the whole prehash thing appears to be > pointless - even the place where we modify *prehash, since in that case > we return ERR_PTR() and the only caller passing non-NULL prehash (_lease) > buggers off on such return value past all code that would look at have_lease > value. Right. > One possible reading is that you want to prevent hashing in !have_lease > case of > dn = splice_dentry(dn, in, _lease); > If that's the case, you might have a problem, since it will be hashed no > matter what... In this case it doesn't actually matter if it is hashed or not, since we will look at the lease state on the dentry before trusting it... This code dates back to when Ceph was originally upstreamed, so the history is murky, but I expect at that point I wanted to avoid hashing in the no-lease case. But I don't think it matters. We should just remove the prehash argument from splice_dentry entirely. Zheng, does that sound right? Thanks! sage
Re: [ceph] what's going on with d_rehash() in splice_dentry()?
Hi Al, On Fri, 26 Feb 2016, Al Viro wrote: > You have, modulo printks and BUG_ON(), > { > struct dentry *realdn; > /* dn must be unhashed */ > if (!d_unhashed(dn)) > d_drop(dn); > realdn = d_splice_alias(in, dn); > if (IS_ERR(realdn)) { > if (prehash) > *prehash = false; /* don't rehash on error */ > dn = realdn; /* note realdn contains the error */ > goto out; > } else if (realdn) { > dput(dn); > dn = realdn; > } > if ((!prehash || *prehash) && d_unhashed(dn)) > d_rehash(dn); > > When d_splice_alias() returns NULL it has hashed the dentry you'd given it; > when it returns a different dentry, that dentry is also returned hashed. > IOW, d_rehash(dn) in there should never be called. > > If you have a case when it _is_ called, you've found a bug somewhere and > I'd like to see details. AFAICS, the whole prehash thing appears to be > pointless - even the place where we modify *prehash, since in that case > we return ERR_PTR() and the only caller passing non-NULL prehash (_lease) > buggers off on such return value past all code that would look at have_lease > value. Right. > One possible reading is that you want to prevent hashing in !have_lease > case of > dn = splice_dentry(dn, in, _lease); > If that's the case, you might have a problem, since it will be hashed no > matter what... In this case it doesn't actually matter if it is hashed or not, since we will look at the lease state on the dentry before trusting it... This code dates back to when Ceph was originally upstreamed, so the history is murky, but I expect at that point I wanted to avoid hashing in the no-lease case. But I don't think it matters. We should just remove the prehash argument from splice_dentry entirely. Zheng, does that sound right? Thanks! sage
[GIT PULL] Ceph fixes for -rc6
Hi Linus, Please pull the following Ceph fixes for -rc6 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are two small messenger bug fixes and a log spam regression fix. Thanks! sage Ilya Dryomov (3): libceph: don't bail early from try_read() when skipping a message libceph: use the right footer size when skipping a message libceph: don't spam dmesg with stray reply warnings net/ceph/messenger.c | 15 +++ net/ceph/osd_client.c | 4 ++-- 2 files changed, 13 insertions(+), 6 deletions(-)
[GIT PULL] Ceph fixes for -rc6
Hi Linus, Please pull the following Ceph fixes for -rc6 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are two small messenger bug fixes and a log spam regression fix. Thanks! sage Ilya Dryomov (3): libceph: don't bail early from try_read() when skipping a message libceph: use the right footer size when skipping a message libceph: don't spam dmesg with stray reply warnings net/ceph/messenger.c | 15 +++ net/ceph/osd_client.c | 4 ++-- 2 files changed, 13 insertions(+), 6 deletions(-)
[GIT PULL] Ceph fixes for -rc3
Hi Linus, Please pull the follow Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus We have a few wire protocol compatibility fixes, ports of a few recent CRUSH mapping changes, and a couple error path fixes. Thanks! sage Dan Carpenter (1): ceph: checking for IS_ERR instead of NULL Ilya Dryomov (6): crush: ensure bucket id is valid before indexing buckets array crush: ensure take bucket value is valid crush: add chooseleaf_stable tunable crush: decode and initialize chooseleaf_stable libceph: advertise support for TUNABLES5 libceph: MOSDOpReply v7 encoding Yan, Zheng (1): ceph: fix snap context leak in error path fs/ceph/file.c | 6 +++--- include/linux/ceph/ceph_features.h | 16 +++- include/linux/crush/crush.h| 8 +++- net/ceph/crush/mapper.c| 33 ++--- net/ceph/osd_client.c | 10 ++ net/ceph/osdmap.c | 19 ++- 6 files changed, 75 insertions(+), 17 deletions(-)
[GIT PULL] Ceph fixes for -rc3
Hi Linus, Please pull the follow Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus We have a few wire protocol compatibility fixes, ports of a few recent CRUSH mapping changes, and a couple error path fixes. Thanks! sage Dan Carpenter (1): ceph: checking for IS_ERR instead of NULL Ilya Dryomov (6): crush: ensure bucket id is valid before indexing buckets array crush: ensure take bucket value is valid crush: add chooseleaf_stable tunable crush: decode and initialize chooseleaf_stable libceph: advertise support for TUNABLES5 libceph: MOSDOpReply v7 encoding Yan, Zheng (1): ceph: fix snap context leak in error path fs/ceph/file.c | 6 +++--- include/linux/ceph/ceph_features.h | 16 +++- include/linux/crush/crush.h| 8 +++- net/ceph/crush/mapper.c| 33 ++--- net/ceph/osd_client.c | 10 ++ net/ceph/osdmap.c | 19 ++- 6 files changed, 75 insertions(+), 17 deletions(-)
[GIT PULL] Ceph updates for -rc1
Hi Linus, Please pull the following Ceph updates for 4.5-rc1 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The two main changes are aio support in CephFS, and a series that fixes several issues in the authentication key timeout/renewal code. On top of that are a variety of cleanups and minor bug fixes. Thanks! sage Geliang Tang (2): libceph: use list_next_entry instead of list_entry_next libceph: use list_for_each_entry_safe Ilya Dryomov (6): libceph: fix ceph_msg_revoke() libceph: clear messenger auth_retry flag if we fault libceph: fix authorizer invalidation, take 2 libceph: invalidate AUTH in addition to a service ticket libceph: kill off ceph_x_ticket_handler::validity libceph: remove outdated comment Markus Elfring (1): rbd: delete an unnecessary check before rbd_dev_destroy() Minfei Huang (1): ceph: Avoid to propagate the invalid page point Yan, Zheng (4): ceph: fix double page_unlock() in page_mkwrite() ceph: Asynchronous IO support ceph: re-send AIO write request when getting -EOLDSNAP error ceph: use i_size_{read,write} to get/set i_size Yaowei Bai (2): ceph: remove unused functions in ceph_frag.h ceph: ceph_frag_contains_value can be boolean drivers/block/rbd.c| 3 +- fs/ceph/addr.c | 14 +- fs/ceph/cache.c| 8 +- fs/ceph/file.c | 509 ++--- fs/ceph/inode.c| 8 +- include/linux/ceph/ceph_frag.h | 37 +-- include/linux/ceph/messenger.h | 2 +- net/ceph/auth_x.c | 49 +++- net/ceph/auth_x.h | 2 +- net/ceph/messenger.c | 105 ++--- net/ceph/mon_client.c | 4 - 11 files changed, 501 insertions(+), 240 deletions(-)
[GIT PULL] Ceph updates for -rc1
Hi Linus, Please pull the following Ceph updates for 4.5-rc1 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The two main changes are aio support in CephFS, and a series that fixes several issues in the authentication key timeout/renewal code. On top of that are a variety of cleanups and minor bug fixes. Thanks! sage Geliang Tang (2): libceph: use list_next_entry instead of list_entry_next libceph: use list_for_each_entry_safe Ilya Dryomov (6): libceph: fix ceph_msg_revoke() libceph: clear messenger auth_retry flag if we fault libceph: fix authorizer invalidation, take 2 libceph: invalidate AUTH in addition to a service ticket libceph: kill off ceph_x_ticket_handler::validity libceph: remove outdated comment Markus Elfring (1): rbd: delete an unnecessary check before rbd_dev_destroy() Minfei Huang (1): ceph: Avoid to propagate the invalid page point Yan, Zheng (4): ceph: fix double page_unlock() in page_mkwrite() ceph: Asynchronous IO support ceph: re-send AIO write request when getting -EOLDSNAP error ceph: use i_size_{read,write} to get/set i_size Yaowei Bai (2): ceph: remove unused functions in ceph_frag.h ceph: ceph_frag_contains_value can be boolean drivers/block/rbd.c| 3 +- fs/ceph/addr.c | 14 +- fs/ceph/cache.c| 8 +- fs/ceph/file.c | 509 ++--- fs/ceph/inode.c| 8 +- include/linux/ceph/ceph_frag.h | 37 +-- include/linux/ceph/messenger.h | 2 +- net/ceph/auth_x.c | 49 +++- net/ceph/auth_x.h | 2 +- net/ceph/messenger.c | 105 ++--- net/ceph/mon_client.c | 4 - 11 files changed, 501 insertions(+), 240 deletions(-)
[GIT PULL] Ceph update for -rc4
Hi Linus, Please pull the following fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This addresses a refcounting bug that leads to a use-after-free. Thanks! sage Ilya Dryomov (1): rbd: don't put snap_context twice in rbd_queue_workfn() drivers/block/rbd.c | 1 + 1 file changed, 1 insertion(+) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph update for -rc4
Hi Linus, Please pull the following fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This addresses a refcounting bug that leads to a use-after-free. Thanks! sage Ilya Dryomov (1): rbd: don't put snap_context twice in rbd_queue_workfn() drivers/block/rbd.c | 1 + 1 file changed, 1 insertion(+) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph changes for -rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are several patches from Ilya fixing RBD allocation lifecycle issues, a series adding a nocephx_sign_messages option (and associated bug fixes/cleanups), several patches from Zheng improving the (directory) fsync behavior, a big improvement in IO for direct-io requests when striping is enabled from Caifeng, and several other small fixes and cleanups. Thanks! sage Arnd Bergmann (1): ceph: fix message length computation Geliang Tang (1): ceph: fix a comment typo Ilya Dryomov (10): rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails rbd: don't free rbd_dev outside of the release callback rbd: set device_type::release instead of device::release rbd: remove duplicate calls to rbd_dev_mapping_clear() libceph: introduce ceph_x_authorizer_cleanup() libceph: msg signing callouts don't need con argument libceph: drop authorizer check from cephx msg signing routines libceph: stop duplicating client fields in messenger libceph: add nocephx_sign_messages option libceph: clear msg->con in ceph_msg_release() only Ioana Ciornei (1): libceph: evaluate osd_req_op_data() arguments only once Julia Lawall (1): rbd: drop null test before destroy functions Shraddha Barke (2): libceph: remove con argument in handle_reply() libceph: use local variable cursor instead of >cursor Yan, Zheng (3): ceph: don't invalidate page cache when inode is no longer used ceph: add request to i_unsafe_dirops when getting unsafe reply ceph: make fsync() wait unsafe requests that created/modified inode Zhu, Caifeng (1): ceph: combine as many iovec as possile into one OSD request drivers/block/rbd.c| 109 - fs/ceph/cache.c| 2 +- fs/ceph/caps.c | 76 ++-- fs/ceph/file.c | 87 fs/ceph/inode.c| 1 + fs/ceph/mds_client.c | 57 +++-- fs/ceph/mds_client.h | 3 ++ fs/ceph/super.h| 1 + include/linux/ceph/libceph.h | 4 +- include/linux/ceph/messenger.h | 16 ++ net/ceph/auth_x.c | 36 +- net/ceph/ceph_common.c | 18 +-- net/ceph/crypto.h | 4 +- net/ceph/messenger.c | 88 ++--- net/ceph/osd_client.c | 34 ++--- 15 files changed, 314 insertions(+), 222 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph changes for -rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are several patches from Ilya fixing RBD allocation lifecycle issues, a series adding a nocephx_sign_messages option (and associated bug fixes/cleanups), several patches from Zheng improving the (directory) fsync behavior, a big improvement in IO for direct-io requests when striping is enabled from Caifeng, and several other small fixes and cleanups. Thanks! sage Arnd Bergmann (1): ceph: fix message length computation Geliang Tang (1): ceph: fix a comment typo Ilya Dryomov (10): rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails rbd: don't free rbd_dev outside of the release callback rbd: set device_type::release instead of device::release rbd: remove duplicate calls to rbd_dev_mapping_clear() libceph: introduce ceph_x_authorizer_cleanup() libceph: msg signing callouts don't need con argument libceph: drop authorizer check from cephx msg signing routines libceph: stop duplicating client fields in messenger libceph: add nocephx_sign_messages option libceph: clear msg->con in ceph_msg_release() only Ioana Ciornei (1): libceph: evaluate osd_req_op_data() arguments only once Julia Lawall (1): rbd: drop null test before destroy functions Shraddha Barke (2): libceph: remove con argument in handle_reply() libceph: use local variable cursor instead of >cursor Yan, Zheng (3): ceph: don't invalidate page cache when inode is no longer used ceph: add request to i_unsafe_dirops when getting unsafe reply ceph: make fsync() wait unsafe requests that created/modified inode Zhu, Caifeng (1): ceph: combine as many iovec as possile into one OSD request drivers/block/rbd.c| 109 - fs/ceph/cache.c| 2 +- fs/ceph/caps.c | 76 ++-- fs/ceph/file.c | 87 fs/ceph/inode.c| 1 + fs/ceph/mds_client.c | 57 +++-- fs/ceph/mds_client.h | 3 ++ fs/ceph/super.h| 1 + include/linux/ceph/libceph.h | 4 +- include/linux/ceph/messenger.h | 16 ++ net/ceph/auth_x.c | 36 +- net/ceph/ceph_common.c | 18 +-- net/ceph/crypto.h | 4 +- net/ceph/messenger.c | 88 ++--- net/ceph/osd_client.c | 34 ++--- 15 files changed, 314 insertions(+), 222 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fix for 4.3
Hi Linus, Please pull the following RBD fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This sets the stable pages flag on the RBD block device when we have CRCs enabled. (This is necessary since the default assumption for block devices changed in 3.9.) Thanks! sage Ronny Hegewald (1): rbd: require stable pages if message data CRCs are enabled drivers/block/rbd.c | 3 +++ 1 file changed, 3 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fix for 4.3
Hi Linus, Please pull the following RBD fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This sets the stable pages flag on the RBD block device when we have CRCs enabled. (This is necessary since the default assumption for block devices changed in 3.9.) Thanks! sage Ronny Hegewald (1): rbd: require stable pages if message data CRCs are enabled drivers/block/rbd.c | 3 +++ 1 file changed, 3 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for -rc7
Hi Linus, Please pull the following two patches from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus One is a stopgap to prevent a stack blowout when users have a deep chain of image clones. (We'll rewrite this code to be non-recursive for the next window, but in the meantime this is a simple fix that avoids a crash.) The second fixes a refcount underflow. Thanks! sage Ilya Dryomov (2): rbd: don't leak parent_spec in rbd_dev_probe_parent() rbd: prevent kernel stack blow up on rbd map drivers/block/rbd.c | 69 ++--- 1 file changed, 39 insertions(+), 30 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for -rc7
Hi Linus, Please pull the following two patches from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus One is a stopgap to prevent a stack blowout when users have a deep chain of image clones. (We'll rewrite this code to be non-recursive for the next window, but in the meantime this is a simple fix that avoids a crash.) The second fixes a refcount underflow. Thanks! sage Ilya Dryomov (2): rbd: don't leak parent_spec in rbd_dev_probe_parent() rbd: prevent kernel stack blow up on rbd map drivers/block/rbd.c | 69 ++--- 1 file changed, 39 insertions(+), 30 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc6
Hi Linus, The following changes since commit 25cb62b76430a91cc6195f902e61c2cb84ade622: Linux 4.3-rc5 (2015-10-11 11:09:45 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus for you to fetch changes up to e30b7577bf1d338ca8a273bd2f881de5a41572b7: rbd: use writefull op for object size writes (2015-10-16 16:49:01 +0200) Just two small items from Ilya: The first patch fixes the RBD readahead to grab full objects. The second fixes the write ops to prevent undue promotion when a cache tier is configured on the server side. Thanks! sage Ilya Dryomov (2): rbd: set max_sectors explicitly rbd: use writefull op for object size writes drivers/block/rbd.c | 10 -- net/ceph/osd_client.c | 13 + 2 files changed, 17 insertions(+), 6 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc6
Hi Linus, The following changes since commit 25cb62b76430a91cc6195f902e61c2cb84ade622: Linux 4.3-rc5 (2015-10-11 11:09:45 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus for you to fetch changes up to e30b7577bf1d338ca8a273bd2f881de5a41572b7: rbd: use writefull op for object size writes (2015-10-16 16:49:01 +0200) Just two small items from Ilya: The first patch fixes the RBD readahead to grab full objects. The second fixes the write ops to prevent undue promotion when a cache tier is configured on the server side. Thanks! sage Ilya Dryomov (2): rbd: set max_sectors explicitly rbd: use writefull op for object size writes drivers/block/rbd.c | 10 -- net/ceph/osd_client.c | 13 + 2 files changed, 17 insertions(+), 6 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Foundation Technical Advisory Board Elections and Nomination process
On Tue, 13 Oct 2015, Grant Likely wrote: > On 11 Oct 2015 05:20, "Ric Wheeler" wrote: > > > > I would like to nominate Sage Weil with his consent. > > > > Sage has lead the ceph project since its inception, contributed to the > kernel as well as had an influence on projects like openstack. > > Sage, what say you? Do you accept your nomination? I do! Thanks- sage > > g. > > > > > thanks! > > > > Ric > > > > > > > > > > On 10/06/2015 01:06 PM, Grant Likely wrote: > >> > >> [Resending because I messed up the first one] > >> > >> The elections for five of the ten members of the Linux Foundation > >> Technical Advisory Board (TAB) are held every year[1]. This year the > >> election will be at the 2015 Kernel Summit in Seoul, South Korea > >> (probably on the Monday, 26 October) and will be open to all attendees > >> of both Kernel Summit and Korea Linux Forum. > >> > >> Anyone is eligible to stand for election, simply send your nomination to: > >> > >> tech-board-disc...@lists.linux-foundation.org > >> > >> We currently have 3 nominees for five places: > >> Thomas Gleixner > >> Greg Kroah-Hartman > >> Stephen Hemminger > >> > >> The deadline for receiving nominations is up until the beginning of > >> the event where the election is held. Although, please remember if > >> you're not going to be present that things go wrong with both networks > >> and mailing lists, so get your nomination in early). > >> > >> Grant Likely, TAB Chair > >> > >> [1] TAB members sit for a term of 2 years, and half of the board is up > >> for election every year. Five of the seats are up for election now. > >> The other five are half way through their term and will be up for > >> election next year. The history of the TAB elections can be found > >> here: > >> > >>https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_ > YQoTNCKA/edit#gid=0 > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in > >> the body of a message to majord...@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> Please read the FAQ at http://www.tux.org/lkml/ > > > > > > >
Re: Linux Foundation Technical Advisory Board Elections and Nomination process
On Tue, 13 Oct 2015, Grant Likely wrote: > On 11 Oct 2015 05:20, "Ric Wheeler" <ricwhee...@gmail.com> wrote: > > > > I would like to nominate Sage Weil with his consent. > > > > Sage has lead the ceph project since its inception, contributed to the > kernel as well as had an influence on projects like openstack. > > Sage, what say you? Do you accept your nomination? I do! Thanks- sage > > g. > > > > > thanks! > > > > Ric > > > > > > > > > > On 10/06/2015 01:06 PM, Grant Likely wrote: > >> > >> [Resending because I messed up the first one] > >> > >> The elections for five of the ten members of the Linux Foundation > >> Technical Advisory Board (TAB) are held every year[1]. This year the > >> election will be at the 2015 Kernel Summit in Seoul, South Korea > >> (probably on the Monday, 26 October) and will be open to all attendees > >> of both Kernel Summit and Korea Linux Forum. > >> > >> Anyone is eligible to stand for election, simply send your nomination to: > >> > >> tech-board-disc...@lists.linux-foundation.org > >> > >> We currently have 3 nominees for five places: > >> Thomas Gleixner > >> Greg Kroah-Hartman > >> Stephen Hemminger > >> > >> The deadline for receiving nominations is up until the beginning of > >> the event where the election is held. Although, please remember if > >> you're not going to be present that things go wrong with both networks > >> and mailing lists, so get your nomination in early). > >> > >> Grant Likely, TAB Chair > >> > >> [1] TAB members sit for a term of 2 years, and half of the board is up > >> for election every year. Five of the seats are up for election now. > >> The other five are half way through their term and will be up for > >> election next year. The history of the TAB elections can be found > >> here: > >> > >>https://docs.google.com/spreadsheets/d/1jGLQtul0taSRq_opYzJFALI7_34cS4RMS1_ > YQoTNCKA/edit#gid=0 > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in > >> the body of a message to majord...@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> Please read the FAQ at http://www.tux.org/lkml/ > > > > > > >
[GIT PULL] Ceph fixes for -rc2
Hi Linus, The following changes since commit 6ff33f3902c3b1c5d0db6b1e2c70b6d76fba357f: Linux 4.3-rc1 (2015-09-12 16:35:56 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus for you to fetch changes up to 335c25858218e76ef47f92ecb9d22e919d36140d: libceph: advertise support for keepalive2 (2015-09-17 20:14:27 +0300) These are both fixes to the new and improved keepalive2 behavior. Thanks! sage Ilya Dryomov (2): libceph: don't access invalid memory in keepalive2 path libceph: advertise support for keepalive2 include/linux/ceph/ceph_features.h | 1 + include/linux/ceph/messenger.h | 4 +++- net/ceph/messenger.c | 9 + 3 files changed, 9 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc2
Hi Linus, The following changes since commit 6ff33f3902c3b1c5d0db6b1e2c70b6d76fba357f: Linux 4.3-rc1 (2015-09-12 16:35:56 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus for you to fetch changes up to 335c25858218e76ef47f92ecb9d22e919d36140d: libceph: advertise support for keepalive2 (2015-09-17 20:14:27 +0300) These are both fixes to the new and improved keepalive2 behavior. Thanks! sage Ilya Dryomov (2): libceph: don't access invalid memory in keepalive2 path libceph: advertise support for keepalive2 include/linux/ceph/ceph_features.h | 1 + include/linux/ceph/messenger.h | 4 +++- net/ceph/messenger.c | 9 + 3 files changed, 9 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph changes for 4.3-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are a few fixes for snapshot behavior with CephFS and support for the new keepalive protocol from Zheng, a libceph fix that affects both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several small fixes and cleanups from Jianpeng and others. Thanks! sage Benoît Canet (1): libceph: Avoid holding the zero page on ceph_msgr_slab_init errors Brad Hubbard (1): ceph: remove redundant test of head->safe and silence static analysis warnings Ilya Dryomov (4): libceph: rename con_work() to ceph_con_workfn() rbd: fix double free on rbd_dev->header_name rbd: plug rbd_dev->header.object_prefix memory leak libceph: check data_len in ->alloc_msg() Jianpeng Ma (3): ceph: remove the useless judgement ceph: no need to get parent inode in ceph_open ceph: cleanup use of ceph_msg_get Nicholas Krause (1): libceph: remove the unused macro AES_KEY_SIZE Yan, Zheng (7): ceph: EIO all operations after forced umount ceph: invalidate dirty pages after forced umount ceph: fix queuing inode to mdsdir's snaprealm libceph: set 'exists' flag for newly up osd libceph: use keepalive2 to verify the mon session is alive ceph: get inode size for each append write ceph: improve readahead for file holes drivers/block/rbd.c| 6 ++-- fs/ceph/addr.c | 6 ++-- fs/ceph/caps.c | 8 + fs/ceph/file.c | 14 fs/ceph/mds_client.c | 59 ++ fs/ceph/mds_client.h | 1 + fs/ceph/snap.c | 7 fs/ceph/super.c| 1 + include/linux/ceph/libceph.h | 2 ++ include/linux/ceph/messenger.h | 4 +++ include/linux/ceph/msgr.h | 4 ++- net/ceph/ceph_common.c | 1 + net/ceph/crypto.c | 4 --- net/ceph/messenger.c | 82 +++--- net/ceph/mon_client.c | 37 ++- net/ceph/osd_client.c | 51 ++ net/ceph/osdmap.c | 2 +- 17 files changed, 191 insertions(+), 98 deletions(-)
[GIT PULL] Ceph changes for 4.3-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are a few fixes for snapshot behavior with CephFS and support for the new keepalive protocol from Zheng, a libceph fix that affects both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several small fixes and cleanups from Jianpeng and others. Thanks! sage Benoît Canet (1): libceph: Avoid holding the zero page on ceph_msgr_slab_init errors Brad Hubbard (1): ceph: remove redundant test of head->safe and silence static analysis warnings Ilya Dryomov (4): libceph: rename con_work() to ceph_con_workfn() rbd: fix double free on rbd_dev->header_name rbd: plug rbd_dev->header.object_prefix memory leak libceph: check data_len in ->alloc_msg() Jianpeng Ma (3): ceph: remove the useless judgement ceph: no need to get parent inode in ceph_open ceph: cleanup use of ceph_msg_get Nicholas Krause (1): libceph: remove the unused macro AES_KEY_SIZE Yan, Zheng (7): ceph: EIO all operations after forced umount ceph: invalidate dirty pages after forced umount ceph: fix queuing inode to mdsdir's snaprealm libceph: set 'exists' flag for newly up osd libceph: use keepalive2 to verify the mon session is alive ceph: get inode size for each append write ceph: improve readahead for file holes drivers/block/rbd.c| 6 ++-- fs/ceph/addr.c | 6 ++-- fs/ceph/caps.c | 8 + fs/ceph/file.c | 14 fs/ceph/mds_client.c | 59 ++ fs/ceph/mds_client.h | 1 + fs/ceph/snap.c | 7 fs/ceph/super.c| 1 + include/linux/ceph/libceph.h | 2 ++ include/linux/ceph/messenger.h | 4 +++ include/linux/ceph/msgr.h | 4 ++- net/ceph/ceph_common.c | 1 + net/ceph/crypto.c | 4 --- net/ceph/messenger.c | 82 +++--- net/ceph/mon_client.c | 37 ++- net/ceph/osd_client.c | 51 ++ net/ceph/osdmap.c | 2 +- 17 files changed, 191 insertions(+), 98 deletions(-)
[GIT PULL] Ceph fixes for -rc6
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are two critical regression fixes for CephFS from Zheng, and an RBD completion fix for layered images from Ilya. (Note: git request-pull is complaining that the for-linus branch isn't referencing the right commit even though it is... hopefully I'm not doing something stupid. The right commit is 2761713d35e370fd640b5781109f753066b746c4.) Thanks! sage Ilya Dryomov (1): rbd: fix copyup completion race Yan, Zheng (2): ceph: fix ceph_encode_locks_to_buffer() ceph: always re-send cap flushes when MDS recovers drivers/block/rbd.c | 22 +- fs/ceph/caps.c | 22 +- fs/ceph/locks.c | 2 +- fs/ceph/super.h | 1 - 4 files changed, 23 insertions(+), 24 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc6
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are two critical regression fixes for CephFS from Zheng, and an RBD completion fix for layered images from Ilya. (Note: git request-pull is complaining that the for-linus branch isn't referencing the right commit even though it is... hopefully I'm not doing something stupid. The right commit is 2761713d35e370fd640b5781109f753066b746c4.) Thanks! sage Ilya Dryomov (1): rbd: fix copyup completion race Yan, Zheng (2): ceph: fix ceph_encode_locks_to_buffer() ceph: always re-send cap flushes when MDS recovers drivers/block/rbd.c | 22 +- fs/ceph/caps.c | 22 +- fs/ceph/locks.c | 2 +- fs/ceph/super.h | 1 - 4 files changed, 23 insertions(+), 24 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc2
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is a fix for CephFS and RBD when used within containers/namespaces, and a fix for the address learning the client is supposed to do when initially talking to the Ceph cluster. There are also two patches updating MAINTAINERS. One breaks out the common Ceph code shared by fs/ceph and drivers/block/rbd.c into a separate entry with the appropriate maintainers listed. The second adds a second reference to the github tree where the Ceph client development takes place (before it is pushed to korg and then to you). The goal here is to move closer to a situation where Ilya Dryomov or one of the other maintainers can push things to you if I am unavailable. Ilya has done most of the work preparing branches for upstream recently; you should not be surprised to hear from him if I am trapped in some internet-less wasteland or hit by a bus or something. In the meantime, we'll work on getting him added to the kernel web of trust. Thanks- sage Ilya Dryomov (2): libceph: enable ceph in a non-default network namespace libceph: treat sockaddr_storage with uninitialized family as blank Sage Weil (2): MAINTAINERS: update ceph entries MAINTAINERS: add secondary tree for ceph modules MAINTAINERS| 22 ++ include/linux/ceph/messenger.h | 3 +++ net/ceph/ceph_common.c | 16 ++-- net/ceph/messenger.c | 24 4 files changed, 47 insertions(+), 18 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc2
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is a fix for CephFS and RBD when used within containers/namespaces, and a fix for the address learning the client is supposed to do when initially talking to the Ceph cluster. There are also two patches updating MAINTAINERS. One breaks out the common Ceph code shared by fs/ceph and drivers/block/rbd.c into a separate entry with the appropriate maintainers listed. The second adds a second reference to the github tree where the Ceph client development takes place (before it is pushed to korg and then to you). The goal here is to move closer to a situation where Ilya Dryomov or one of the other maintainers can push things to you if I am unavailable. Ilya has done most of the work preparing branches for upstream recently; you should not be surprised to hear from him if I am trapped in some internet-less wasteland or hit by a bus or something. In the meantime, we'll work on getting him added to the kernel web of trust. Thanks- sage Ilya Dryomov (2): libceph: enable ceph in a non-default network namespace libceph: treat sockaddr_storage with uninitialized family as blank Sage Weil (2): MAINTAINERS: update ceph entries MAINTAINERS: add secondary tree for ceph modules MAINTAINERS| 22 ++ include/linux/ceph/messenger.h | 3 +++ net/ceph/ceph_common.c | 16 ++-- net/ceph/messenger.c | 24 4 files changed, 47 insertions(+), 18 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for -rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus We have a pile of bug fixes from Ilya, including a few patches that sync up the CRUSH code with the latest from userspace. There is also a long series from Zheng that fixes various issues with snapshots, inline data, and directory fsync, some simplification and improvement in the cap release code, and a rework of the caching of directory contents. To top it off there are a few small fixes and cleanups from Benoit and Hong. Thanks! sage Benoît Canet (2): libceph: Remove spurious kunmap() of the zero page libceph: Fix ceph_tcp_sendpage()'s more boolean usage Hong Zhiguo (1): libceph: fix wrong name "Ceph filesystem for Linux" Ilya Dryomov (14): libceph: use kvfree() instead of open-coding it libceph: nuke time_sub() libceph: store timeouts in jiffies, verify user input libceph: a couple tweaks for wait loops ceph: simplify two mount_timeout sites rbd: timeout watch teardown on unmap with mount_timeout crush: fix crash from invalid 'take' argument crush: sync up with userspace rbd: bump queue_max_segments rbd: terminate rbd_opts_tokens with Opt_err rbd: store rbd_options in rbd_device rbd: queue_depth map option crush: fix a bug in tree bucket decode rbd: use GFP_NOIO in rbd_obj_request_create() Yan, Zheng (23): libceph: properly release STAT request's raw_data_in libceph: allow setting osd_req_op's flags ceph: check OSD caps before read/write ceph: use empty snap context for uninline_data and get_pool_perm ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference ceph: avoid sending unnessesary FLUSHSNAP message ceph: take snap_rwsem when accessing snap realm's cached_context ceph: don't trim auth cap when there are cap snaps ceph: make sure syncfs flushes all cap snaps ceph: don't pre-allocate space for cap release messages ceph: exclude setfilelock requests when calculating oldest tid ceph: ratelimit warn messages for MDS closes session ceph: don't include used caps in cap_wanted ceph: fix flushing caps ceph: fix directory fsync ceph: track pending caps flushing accurately ceph: track pending caps flushing globally ceph: send TID of the oldest pending caps flush to MDS ceph: re-send flushing caps (which are revoked) in reconnect stage ceph: pre-allocate data structure that tracks caps flushing ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL ceph: rework dcache readdir ceph: fix ceph_writepages_start() drivers/block/rbd.c | 111 -- fs/ceph/acl.c | 4 +- fs/ceph/addr.c | 308 --- fs/ceph/caps.c | 836 +++- fs/ceph/dir.c | 383 -- fs/ceph/file.c | 61 ++- fs/ceph/inode.c | 155 ++-- fs/ceph/mds_client.c| 425 +++- fs/ceph/mds_client.h| 23 +- fs/ceph/snap.c | 173 + fs/ceph/super.c | 25 +- fs/ceph/super.h | 125 +++--- fs/ceph/xattr.c | 65 +++- include/linux/ceph/libceph.h| 21 +- include/linux/ceph/osd_client.h | 2 +- include/linux/crush/crush.h | 40 +- include/linux/crush/hash.h | 6 + include/linux/crush/mapper.h| 2 +- net/ceph/ceph_common.c | 50 ++- net/ceph/crush/crush.c | 13 +- net/ceph/crush/crush_ln_table.h | 32 +- net/ceph/crush/hash.c | 8 +- net/ceph/crush/mapper.c | 148 --- net/ceph/messenger.c| 3 +- net/ceph/mon_client.c | 13 +- net/ceph/osd_client.c | 42 +- net/ceph/osdmap.c | 2 +- net/ceph/pagevec.c | 5 +- 28 files changed, 2010 insertions(+), 1071 deletions(-)
[GIT PULL] Ceph updates for -rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus We have a pile of bug fixes from Ilya, including a few patches that sync up the CRUSH code with the latest from userspace. There is also a long series from Zheng that fixes various issues with snapshots, inline data, and directory fsync, some simplification and improvement in the cap release code, and a rework of the caching of directory contents. To top it off there are a few small fixes and cleanups from Benoit and Hong. Thanks! sage Benoît Canet (2): libceph: Remove spurious kunmap() of the zero page libceph: Fix ceph_tcp_sendpage()'s more boolean usage Hong Zhiguo (1): libceph: fix wrong name Ceph filesystem for Linux Ilya Dryomov (14): libceph: use kvfree() instead of open-coding it libceph: nuke time_sub() libceph: store timeouts in jiffies, verify user input libceph: a couple tweaks for wait loops ceph: simplify two mount_timeout sites rbd: timeout watch teardown on unmap with mount_timeout crush: fix crash from invalid 'take' argument crush: sync up with userspace rbd: bump queue_max_segments rbd: terminate rbd_opts_tokens with Opt_err rbd: store rbd_options in rbd_device rbd: queue_depth map option crush: fix a bug in tree bucket decode rbd: use GFP_NOIO in rbd_obj_request_create() Yan, Zheng (23): libceph: properly release STAT request's raw_data_in libceph: allow setting osd_req_op's flags ceph: check OSD caps before read/write ceph: use empty snap context for uninline_data and get_pool_perm ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR reference ceph: avoid sending unnessesary FLUSHSNAP message ceph: take snap_rwsem when accessing snap realm's cached_context ceph: don't trim auth cap when there are cap snaps ceph: make sure syncfs flushes all cap snaps ceph: don't pre-allocate space for cap release messages ceph: exclude setfilelock requests when calculating oldest tid ceph: ratelimit warn messages for MDS closes session ceph: don't include used caps in cap_wanted ceph: fix flushing caps ceph: fix directory fsync ceph: track pending caps flushing accurately ceph: track pending caps flushing globally ceph: send TID of the oldest pending caps flush to MDS ceph: re-send flushing caps (which are revoked) in reconnect stage ceph: pre-allocate data structure that tracks caps flushing ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL ceph: rework dcache readdir ceph: fix ceph_writepages_start() drivers/block/rbd.c | 111 -- fs/ceph/acl.c | 4 +- fs/ceph/addr.c | 308 --- fs/ceph/caps.c | 836 +++- fs/ceph/dir.c | 383 -- fs/ceph/file.c | 61 ++- fs/ceph/inode.c | 155 ++-- fs/ceph/mds_client.c| 425 +++- fs/ceph/mds_client.h| 23 +- fs/ceph/snap.c | 173 + fs/ceph/super.c | 25 +- fs/ceph/super.h | 125 +++--- fs/ceph/xattr.c | 65 +++- include/linux/ceph/libceph.h| 21 +- include/linux/ceph/osd_client.h | 2 +- include/linux/crush/crush.h | 40 +- include/linux/crush/hash.h | 6 + include/linux/crush/mapper.h| 2 +- net/ceph/ceph_common.c | 50 ++- net/ceph/crush/crush.c | 13 +- net/ceph/crush/crush_ln_table.h | 32 +- net/ceph/crush/hash.c | 8 +- net/ceph/crush/mapper.c | 148 --- net/ceph/messenger.c| 3 +- net/ceph/mon_client.c | 13 +- net/ceph/osd_client.c | 42 +- net/ceph/osdmap.c | 2 +- net/ceph/pagevec.c | 5 +- 28 files changed, 2010 insertions(+), 1071 deletions(-)
Re: [GIT PULL] Ceph fixes for -rc5
On Fri, 22 May 2015, Linus Torvalds wrote: > On Fri, May 22, 2015 at 5:13 PM, Sage Weil wrote: > > Hi Linus, > > > > Please pull the following fixes from > > > > git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git > > Nothing there. > > Did you perhaps mean the "for-linus" branch? > > Please fix whatever script it is you use that generates bad pull requests. Bah, I forgot to push the for-linus branch--it's there now. Sorry! (BTW, git://git.kernel.org is going really slowly today... :/) Thanks- sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc5
Hi Linus, Please pull the following fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git These fix an issue with the RBD notifications when there are topology changes in the cluster. Thanks! sage Ilya Dryomov (2): libceph: request a new osdmap if lingering request maps to no osd Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" net/ceph/osd_client.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc5
Hi Linus, Please pull the following fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git These fix an issue with the RBD notifications when there are topology changes in the cluster. Thanks! sage Ilya Dryomov (2): libceph: request a new osdmap if lingering request maps to no osd Revert libceph: clear r_req_lru_item in __unregister_linger_request() net/ceph/osd_client.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] Ceph fixes for -rc5
On Fri, 22 May 2015, Linus Torvalds wrote: On Fri, May 22, 2015 at 5:13 PM, Sage Weil sw...@redhat.com wrote: Hi Linus, Please pull the following fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git Nothing there. Did you perhaps mean the for-linus branch? Please fix whatever script it is you use that generates bad pull requests. Bah, I forgot to push the for-linus branch--it's there now. Sorry! (BTW, git://git.kernel.org is going really slowly today... :/) Thanks- sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Tue, 12 May 2015, Dave Chinner wrote: > > > Neither of these examples cases are under the control of the > > > application that calls open(O_NOMTIME). > > > > Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only > > nodes provisioned explicitly to run these systems would be enable this > > option. > > Back to my Joe Speedracer comments. > > I'm not sure what the right answer is - mount options are simply too > easy to add without understanding the full implications of them. > e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was > too dangerous for unsuspecting users. This isn't at that same level > or concern, but it's still a landmine we want to avoid users from > arming without realising it... > > > > >> I'm happy for it to be an ioctl interface - even an XFS specific > > > >> interface if you want to go that route, Sage - and it probably > > > >> should emit a warning to syslog first time it is used so there is > > > >> trace for bug triage purposes. i.e. we know the app is not using > > > >> mtime updates, so bug reports that are the result of mtime > > > >> mishandling don't result in large amounts of wasted developer time > > > >> trying to understand them... > > > > > > > > A warning on using the interface (or when mounting with user_nomtime) > > > > sounds reasonable. > > > > > > > > I'd rather not make this XFS specific as other local filesystmes (ext4, > > > > f2fs, possibly btrfs) would similarly benefit. (And if we want to > > > > target > > > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as > > > > it > > > > already does O_NOMTIME unconditionally.) > > > > > > Lack of a namespace, doesn't imply that you don't want to manage the > > > data. The whole point of using object storage instead of plain old > > > block storage is to be able to provide whatever metadata you still > > > need in order to manage the object. > > > > Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd > > like to use) doesn't assume O_NOMTIME. > > Right - the XFS ioctls were designed specifically for applications > that interacted directly with the structure of XFS filesystems and > so needed invisible IO (e.g. online defragmenter). IOWs, they are > not interfaces intended for general usage. They are also only > available to root, so a typical user application won't be making use > of them, either. I understand that's what they're intended for, but I'm having a hard time parsing out the difference between what they *do* and what O_NOMTIME + -o allow_nomtime does. The open-by-handle ioctls have nothing to do with the online XFS format--they simply allow you to open a file via an opaque handle (albeit a differently formatted one than the generic open_by_handle_at(2)). They also force you into an O_NOMTIME-equivalent mode. AFAICS the only difference that I see is that 1) the ioctl is XFS specific. (As open_by_handle_at(2) demonstrates, this needn't be the case.) 2) the NOMTIME mode is only available via the open-by-handle interface, not open(2). 3) it is an ioctl interface, and thus more obscure. (Well, there is a libhandle library, but it doesn't seem to be widely used.) Would you object less if 1) the O_NOMTIME flag were only available via open_by_handle_at(2)? 2) an equivalent ioctl were implemented for each file system of interest that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME flag? 3) O_NOMTIME required root (vs a mount option that requires root and unpriviledged O_NOMTIME)? Just trying to tease apart which part is problematic... Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Tue, 12 May 2015, Kevin Easton wrote: > On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: > > On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: > > > > Let me re-ask the question that I asked last week (and was apparently > > > > ignored). Why not trying to use the lazytime feature instead of > > > > pointing a head straight at the application's --- and system > > > > administrators' --- heads? > > > > > > Sorry Ted, I thought I responded already. > > > > > > The goal is to avoid inode writeout entirely when we can, and > > > as I understand it lazytime will still force writeout before the inode > > > is dropped from the cache. In systems like Ceph in particular, the > > > IOs can be spread across lots of files, so simply deferring writeout > > > doesn't always help. > > > > Sure, but it would reduce the writeout by orders of magnitude. I can > > understand if you want to reduce it further, but it might be good > > enough for your purposes. > > > > I considered doing the equivalent of O_NOMTIME for our purposes at > > $WORK, and our use case is actually not that different from Ceph's > > (i.e., using a local disk file system to support a cluster file > > system), and lazytime was (a) something I figured was something I > > could upstream in good conscience, and (b) was more than good enough > > for us. > > A safer alternative might be a chattr file attribute that if set, the > mtime is not updated on writes, and stat() on the file always shows the > mtime as "right now". At least that way, the file won't accidentally > get left out of backups that rely on the mtime. > > (If the file attribute is unset, you immediately update the mtime then > too, and from then on the file is back to normal). Interesting! I didn't realize there was already a chattr +A that disabled atime (although I suspect it doesn't do the "right now" for stat thing). This makes the nomtime-ness a bit more obscure (I don't think most users would think to check these file attributes), but it's a safer failure condition for backups at least. The fact that chattr +A (and hopefully +M) will work for non-root is a bonus, as we're also trying to get ceph daemons to drop most privileges. sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Tue, 12 May 2015, Kevin Easton wrote: On Mon, May 11, 2015 at 07:10:21PM -0400, Theodore Ts'o wrote: On Mon, May 11, 2015 at 09:24:09AM -0700, Sage Weil wrote: Let me re-ask the question that I asked last week (and was apparently ignored). Why not trying to use the lazytime feature instead of pointing a head straight at the application's --- and system administrators' --- heads? Sorry Ted, I thought I responded already. The goal is to avoid inode writeout entirely when we can, and as I understand it lazytime will still force writeout before the inode is dropped from the cache. In systems like Ceph in particular, the IOs can be spread across lots of files, so simply deferring writeout doesn't always help. Sure, but it would reduce the writeout by orders of magnitude. I can understand if you want to reduce it further, but it might be good enough for your purposes. I considered doing the equivalent of O_NOMTIME for our purposes at $WORK, and our use case is actually not that different from Ceph's (i.e., using a local disk file system to support a cluster file system), and lazytime was (a) something I figured was something I could upstream in good conscience, and (b) was more than good enough for us. A safer alternative might be a chattr file attribute that if set, the mtime is not updated on writes, and stat() on the file always shows the mtime as right now. At least that way, the file won't accidentally get left out of backups that rely on the mtime. (If the file attribute is unset, you immediately update the mtime then too, and from then on the file is back to normal). Interesting! I didn't realize there was already a chattr +A that disabled atime (although I suspect it doesn't do the right now for stat thing). This makes the nomtime-ness a bit more obscure (I don't think most users would think to check these file attributes), but it's a safer failure condition for backups at least. The fact that chattr +A (and hopefully +M) will work for non-root is a bonus, as we're also trying to get ceph daemons to drop most privileges. sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Tue, 12 May 2015, Dave Chinner wrote: Neither of these examples cases are under the control of the application that calls open(O_NOMTIME). Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only nodes provisioned explicitly to run these systems would be enable this option. Back to my Joe Speedracer comments. I'm not sure what the right answer is - mount options are simply too easy to add without understanding the full implications of them. e.g. we didn't merge FALLOC_FL_NO_HIDE_STALE simply because it was too dangerous for unsuspecting users. This isn't at that same level or concern, but it's still a landmine we want to avoid users from arming without realising it... I'm happy for it to be an ioctl interface - even an XFS specific interface if you want to go that route, Sage - and it probably should emit a warning to syslog first time it is used so there is trace for bug triage purposes. i.e. we know the app is not using mtime updates, so bug reports that are the result of mtime mishandling don't result in large amounts of wasted developer time trying to understand them... A warning on using the interface (or when mounting with user_nomtime) sounds reasonable. I'd rather not make this XFS specific as other local filesystmes (ext4, f2fs, possibly btrfs) would similarly benefit. (And if we want to target XFS specifically the existing XFS open-by-handle ioctl is sufficient as it already does O_NOMTIME unconditionally.) Lack of a namespace, doesn't imply that you don't want to manage the data. The whole point of using object storage instead of plain old block storage is to be able to provide whatever metadata you still need in order to manage the object. Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd like to use) doesn't assume O_NOMTIME. Right - the XFS ioctls were designed specifically for applications that interacted directly with the structure of XFS filesystems and so needed invisible IO (e.g. online defragmenter). IOWs, they are not interfaces intended for general usage. They are also only available to root, so a typical user application won't be making use of them, either. I understand that's what they're intended for, but I'm having a hard time parsing out the difference between what they *do* and what O_NOMTIME + -o allow_nomtime does. The open-by-handle ioctls have nothing to do with the online XFS format--they simply allow you to open a file via an opaque handle (albeit a differently formatted one than the generic open_by_handle_at(2)). They also force you into an O_NOMTIME-equivalent mode. AFAICS the only difference that I see is that 1) the ioctl is XFS specific. (As open_by_handle_at(2) demonstrates, this needn't be the case.) 2) the NOMTIME mode is only available via the open-by-handle interface, not open(2). 3) it is an ioctl interface, and thus more obscure. (Well, there is a libhandle library, but it doesn't seem to be widely used.) Would you object less if 1) the O_NOMTIME flag were only available via open_by_handle_at(2)? 2) an equivalent ioctl were implemented for each file system of interest that (say) called into open_by_handle_at(2) code, adding in the O_NOMTIME flag? 3) O_NOMTIME required root (vs a mount option that requires root and unpriviledged O_NOMTIME)? Just trying to tease apart which part is problematic... Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, 11 May 2015, Trond Myklebust wrote: > On Mon, May 11, 2015 at 12:39 PM, Sage Weil wrote: > > On Mon, 11 May 2015, Dave Chinner wrote: > >> On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > >> > On Fri, May 8, 2015 at 6:24 PM, Sage Weil wrote: > >> > > I'm sure you realize what we're try to achieve is the same "invisible > >> > > IO" > >> > > that the XFS open by handle ioctls do by default. Would you be more > >> > > comfortable if this option where only available to the generic > >> > > open_by_handle syscall, and not to open(2)? > >> > > >> > It should be an ioctl(). It has no business being part of > >> > open_by_handle either, since that is another generic interface. > > > > Our use-case doesn't make sense on network file systems, but it does on > > any reasonably featureful local filesystem, and the goal is to be generic > > there. If mtime is critical to a network file system's consistency it > > seems pretty reasonable to disallow/ignore it for just that file system > > (e.g., by masking off the flag at open time), as others won't have that > > same problem (cephfs doesn't, for example). > > > > Perhaps making each fs opt-in instead of handling it in a generic path > > would alleviate this concern? > > The issue isn't whether or not you have a network file system, it's > whether or not you want users to be able to manage data. mtime isn't > useful for the application (which knows whether or not it has changed > the file) or for the filesystem (ditto). It exists, rather, in order > to enable data management by users and other applications, letting > them know whether or not the data contents of the file have changed, > and when that change occurred. Agreed. > If you are able to guarantee that your users don't care about that, > then fine, but that would be a very special case that doesn't fit the > way that most data centres are run. Backups are one case where mtime > matters, tiering and archiving is another. This is true, although I argue it is becoming increasingly common for the data management (including backups and so forth) to be layered not on top of the POSIX file system but on something higher up in the stack. This is true of pretty much any distributed system (ceph, cassandra, mongo, etc., and I assume commercial databases like Oracle, too) where backups, replication, and any other DR strategies need to be orchestrated across nodes to be consistent--simply copying files out from underneath them is already insufficient and a recipe for disaster. There is a growing category of applications that can benefit from this capability... > Neither of these examples > cases are under the control of the application that calls > open(O_NOMTIME). Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only nodes provisioned explicitly to run these systems would be enable this option. > >> I'm happy for it to be an ioctl interface - even an XFS specific > >> interface if you want to go that route, Sage - and it probably > >> should emit a warning to syslog first time it is used so there is > >> trace for bug triage purposes. i.e. we know the app is not using > >> mtime updates, so bug reports that are the result of mtime > >> mishandling don't result in large amounts of wasted developer time > >> trying to understand them... > > > > A warning on using the interface (or when mounting with user_nomtime) > > sounds reasonable. > > > > I'd rather not make this XFS specific as other local filesystmes (ext4, > > f2fs, possibly btrfs) would similarly benefit. (And if we want to target > > XFS specifically the existing XFS open-by-handle ioctl is sufficient as it > > already does O_NOMTIME unconditionally.) > > Lack of a namespace, doesn't imply that you don't want to manage the > data. The whole point of using object storage instead of plain old > block storage is to be able to provide whatever metadata you still > need in order to manage the object. Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd like to use) doesn't assume O_NOMTIME. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, 11 May 2015, Dave Chinner wrote: > On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > > On Fri, May 8, 2015 at 6:24 PM, Sage Weil wrote: > > > I'm sure you realize what we're try to achieve is the same "invisible IO" > > > that the XFS open by handle ioctls do by default. Would you be more > > > comfortable if this option where only available to the generic > > > open_by_handle syscall, and not to open(2)? > > > > It should be an ioctl(). It has no business being part of > > open_by_handle either, since that is another generic interface. Our use-case doesn't make sense on network file systems, but it does on any reasonably featureful local filesystem, and the goal is to be generic there. If mtime is critical to a network file system's consistency it seems pretty reasonable to disallow/ignore it for just that file system (e.g., by masking off the flag at open time), as others won't have that same problem (cephfs doesn't, for example). Perhaps making each fs opt-in instead of handling it in a generic path would alleviate this concern? > I'm happy for it to be an ioctl interface - even an XFS specific > interface if you want to go that route, Sage - and it probably > should emit a warning to syslog first time it is used so there is > trace for bug triage purposes. i.e. we know the app is not using > mtime updates, so bug reports that are the result of mtime > mishandling don't result in large amounts of wasted developer time > trying to understand them... A warning on using the interface (or when mounting with user_nomtime) sounds reasonable. I'd rather not make this XFS specific as other local filesystmes (ext4, f2fs, possibly btrfs) would similarly benefit. (And if we want to target XFS specifically the existing XFS open-by-handle ioctl is sufficient as it already does O_NOMTIME unconditionally.) sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, 11 May 2015, Theodore Ts'o wrote: > On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: > > That makes it completely non-generic though. By putting this in the > > VFS, you are giving applications a loaded gun that is pointed straight > > at the application user's head. > > Let me re-ask the question that I asked last week (and was apparently > ignored). Why not trying to use the lazytime feature instead of > pointing a head straight at the application's --- and system > administrators' --- heads? Sorry Ted, I thought I responded already. The goal is to avoid inode writeout entirely when we can, and as I understand it lazytime will still force writeout before the inode is dropped from the cache. In systems like Ceph in particular, the IOs can be spread across lots of files, so simply deferring writeout doesn't always help. sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, 11 May 2015, Theodore Ts'o wrote: On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: That makes it completely non-generic though. By putting this in the VFS, you are giving applications a loaded gun that is pointed straight at the application user's head. Let me re-ask the question that I asked last week (and was apparently ignored). Why not trying to use the lazytime feature instead of pointing a head straight at the application's --- and system administrators' --- heads? Sorry Ted, I thought I responded already. The goal is to avoid inode writeout entirely when we can, and as I understand it lazytime will still force writeout before the inode is dropped from the cache. In systems like Ceph in particular, the IOs can be spread across lots of files, so simply deferring writeout doesn't always help. sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, 11 May 2015, Trond Myklebust wrote: On Mon, May 11, 2015 at 12:39 PM, Sage Weil s...@newdream.net wrote: On Mon, 11 May 2015, Dave Chinner wrote: On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: On Fri, May 8, 2015 at 6:24 PM, Sage Weil s...@newdream.net wrote: I'm sure you realize what we're try to achieve is the same invisible IO that the XFS open by handle ioctls do by default. Would you be more comfortable if this option where only available to the generic open_by_handle syscall, and not to open(2)? It should be an ioctl(). It has no business being part of open_by_handle either, since that is another generic interface. Our use-case doesn't make sense on network file systems, but it does on any reasonably featureful local filesystem, and the goal is to be generic there. If mtime is critical to a network file system's consistency it seems pretty reasonable to disallow/ignore it for just that file system (e.g., by masking off the flag at open time), as others won't have that same problem (cephfs doesn't, for example). Perhaps making each fs opt-in instead of handling it in a generic path would alleviate this concern? The issue isn't whether or not you have a network file system, it's whether or not you want users to be able to manage data. mtime isn't useful for the application (which knows whether or not it has changed the file) or for the filesystem (ditto). It exists, rather, in order to enable data management by users and other applications, letting them know whether or not the data contents of the file have changed, and when that change occurred. Agreed. If you are able to guarantee that your users don't care about that, then fine, but that would be a very special case that doesn't fit the way that most data centres are run. Backups are one case where mtime matters, tiering and archiving is another. This is true, although I argue it is becoming increasingly common for the data management (including backups and so forth) to be layered not on top of the POSIX file system but on something higher up in the stack. This is true of pretty much any distributed system (ceph, cassandra, mongo, etc., and I assume commercial databases like Oracle, too) where backups, replication, and any other DR strategies need to be orchestrated across nodes to be consistent--simply copying files out from underneath them is already insufficient and a recipe for disaster. There is a growing category of applications that can benefit from this capability... Neither of these examples cases are under the control of the application that calls open(O_NOMTIME). Wouldn't a mount option (e.g., allow_nomtime) address this concern? Only nodes provisioned explicitly to run these systems would be enable this option. I'm happy for it to be an ioctl interface - even an XFS specific interface if you want to go that route, Sage - and it probably should emit a warning to syslog first time it is used so there is trace for bug triage purposes. i.e. we know the app is not using mtime updates, so bug reports that are the result of mtime mishandling don't result in large amounts of wasted developer time trying to understand them... A warning on using the interface (or when mounting with user_nomtime) sounds reasonable. I'd rather not make this XFS specific as other local filesystmes (ext4, f2fs, possibly btrfs) would similarly benefit. (And if we want to target XFS specifically the existing XFS open-by-handle ioctl is sufficient as it already does O_NOMTIME unconditionally.) Lack of a namespace, doesn't imply that you don't want to manage the data. The whole point of using object storage instead of plain old block storage is to be able to provide whatever metadata you still need in order to manage the object. Yeah, agreed--this is presumably why open_by_handle(2) (which is what we'd like to use) doesn't assume O_NOMTIME. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Mon, 11 May 2015, Dave Chinner wrote: On Sun, May 10, 2015 at 07:13:24PM -0400, Trond Myklebust wrote: On Fri, May 8, 2015 at 6:24 PM, Sage Weil s...@newdream.net wrote: I'm sure you realize what we're try to achieve is the same invisible IO that the XFS open by handle ioctls do by default. Would you be more comfortable if this option where only available to the generic open_by_handle syscall, and not to open(2)? It should be an ioctl(). It has no business being part of open_by_handle either, since that is another generic interface. Our use-case doesn't make sense on network file systems, but it does on any reasonably featureful local filesystem, and the goal is to be generic there. If mtime is critical to a network file system's consistency it seems pretty reasonable to disallow/ignore it for just that file system (e.g., by masking off the flag at open time), as others won't have that same problem (cephfs doesn't, for example). Perhaps making each fs opt-in instead of handling it in a generic path would alleviate this concern? I'm happy for it to be an ioctl interface - even an XFS specific interface if you want to go that route, Sage - and it probably should emit a warning to syslog first time it is used so there is trace for bug triage purposes. i.e. we know the app is not using mtime updates, so bug reports that are the result of mtime mishandling don't result in large amounts of wasted developer time trying to understand them... A warning on using the interface (or when mounting with user_nomtime) sounds reasonable. I'd rather not make this XFS specific as other local filesystmes (ext4, f2fs, possibly btrfs) would similarly benefit. (And if we want to target XFS specifically the existing XFS open-by-handle ioctl is sufficient as it already does O_NOMTIME unconditionally.) sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Sat, 9 May 2015, Dave Chinner wrote: > On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote: > > On Thu, May 7, 2015 at 9:01 PM, Sage Weil wrote: > > > On Thu, 7 May 2015, Zach Brown wrote: > > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > >> > > owning the file or having the CAP_FOWNER capability. If we're not > > >> > > comfortable allowing owners to prevent mtime/ctime updates then we > > >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > >> > > > >> > I dislike "turn off safety for performance" options because Joe > > >> > SpeedRacer will always select performance over safety. > > >> > > >> Well, for ceph there's no safety concern. They never use cmtime in > > >> these files. > > >> > > >> So are you suggesting not implementing this and making them rework their > > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe > > >> Speedracer more rope? Or are we talking about adding some speed bumps > > >> that ceph can flip on that might give Joe Speedracer pause? > > > > > > I think this is the fundamental question: who do we give the ammunition > > > to, the user or app writer, or the sysadmin? > > > > > > One might argue that we gave the user a similar power with O_NOATIME (the > > > power to break applications that assume atime is accurate). Here we give > > > developers/users the power to not update mtime and suffer the consequences > > > (like, obviously, breaking mtime-based backups). It should be pretty > > > obvious to anyone using the flag what the consequences are. > > > > > > Note that we can suffer similar lapses in mtime with fdatasync followed by > > > a system crash. And as Andy points out it's semi-broken for writable > > > mmap. The crash case is obviously a slightly different thing, but the > > > idea that mtime can't always be trusted certainly isn't crazy talk. > > > > > > Or, we can be conservative and require a mount option so that the admin > > > has to explicitly allow behavior that might break some existing > > > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > > > > > I'm happy either way, so long as in the end an unprivileged ceph daemon > > > avoids the useless work. In our case we always own the entire mount/disk, > > > so a mount option is just fine. > > > > > > > So, what is the expectation here for filesystems that cannot support > > this flag? NFSv3 in particular would break pretty catastrophically if > > someone decided on a whim to turn off mtime: they will have turned off > > the client's ability to detect cache incoherencies. > > It's worse than that, now that I think about it. I think nomtime > will break nfsv4 as the I_VERSION check is done *after* the > NO[C]MTIME checks. e.g. the atomic change count used to detect file > changes is only updated during the mtime update on write() calls in > XFS. i.e. when the timestamp is changed, a transaction to change > mtime is run, and that transaction commit bumps the change count. > > So cutting out mtime updates at the VFS will prevent XFS and other > I_VERSION aware filesystems from updating the change count that > NFSv4 clients rely on to detect foreign data changes in a file. > > Not sure what to do here, because the current NOCMTIME > implementation intentionally cuts out the timestamp update because > it's usage is fully invisible IO. i.e. it is used by utilities like > xfs_fsr and HSMs to move data into and out of files without the > application being able to detect the data movement in any way. These > are not data modification operations, though - the file contents as > read by the application do not change despite the fact we are moving > data in and out of the file. In this case we don't want timestamps > or change counters to change on the data movement, so I think we've > actually got a difference in behaviour here between O_NOMTIME and > O_NOCMTIME, right? > > i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on > write, just not modify the timestamp? In which case, not modifying > the timestamps gains us nothing, because the inode is still dirtied? Right: if we dirty the inode we've defeated the purpose of the patch. > The list of caveats on O_NOMTIME seems
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Thu, 7 May 2015, Trond Myklebust wrote: > On Thu, May 7, 2015 at 9:01 PM, Sage Weil wrote: > > On Thu, 7 May 2015, Zach Brown wrote: > >> On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > >> > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > >> > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > >> > > owning the file or having the CAP_FOWNER capability. If we're not > >> > > comfortable allowing owners to prevent mtime/ctime updates then we > >> > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > >> > > >> > I dislike "turn off safety for performance" options because Joe > >> > SpeedRacer will always select performance over safety. > >> > >> Well, for ceph there's no safety concern. They never use cmtime in > >> these files. > >> > >> So are you suggesting not implementing this and making them rework their > >> IO paths to avoid the fs maintaining mtime so that we don't give Joe > >> Speedracer more rope? Or are we talking about adding some speed bumps > >> that ceph can flip on that might give Joe Speedracer pause? > > > > I think this is the fundamental question: who do we give the ammunition > > to, the user or app writer, or the sysadmin? > > > > One might argue that we gave the user a similar power with O_NOATIME (the > > power to break applications that assume atime is accurate). Here we give > > developers/users the power to not update mtime and suffer the consequences > > (like, obviously, breaking mtime-based backups). It should be pretty > > obvious to anyone using the flag what the consequences are. > > > > Note that we can suffer similar lapses in mtime with fdatasync followed by > > a system crash. And as Andy points out it's semi-broken for writable > > mmap. The crash case is obviously a slightly different thing, but the > > idea that mtime can't always be trusted certainly isn't crazy talk. > > > > Or, we can be conservative and require a mount option so that the admin > > has to explicitly allow behavior that might break some existing > > assumptions about mtime/ctime ('-o user_noatime' I guess?). > > > > I'm happy either way, so long as in the end an unprivileged ceph daemon > > avoids the useless work. In our case we always own the entire mount/disk, > > so a mount option is just fine. > > > > So, what is the expectation here for filesystems that cannot support > this flag? NFSv3 in particular would break pretty catastrophically if > someone decided on a whim to turn off mtime: they will have turned off > the client's ability to detect cache incoherencies. Is this based on mtime or ctime? If the former, would things could also break if a user does, say, some stat(2), write(2), utimes(2) shenanigans? So, my assumption is that if the mount option isn't there allowing this then O_NOMTIME would be a no-op (as opposed to EPERM or something)... but maybe that's not the right thing to do. Whatever we do there, though, I suppose NFS would do the same thing? sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Thu, 7 May 2015, Trond Myklebust wrote: On Thu, May 7, 2015 at 9:01 PM, Sage Weil s...@newdream.net wrote: On Thu, 7 May 2015, Zach Brown wrote: On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: The criteria for using O_NOMTIME is the same as for using O_NOATIME: owning the file or having the CAP_FOWNER capability. If we're not comfortable allowing owners to prevent mtime/ctime updates then we should add a tunable to allow O_NOMTIME. Maybe a mount option? I dislike turn off safety for performance options because Joe SpeedRacer will always select performance over safety. Well, for ceph there's no safety concern. They never use cmtime in these files. So are you suggesting not implementing this and making them rework their IO paths to avoid the fs maintaining mtime so that we don't give Joe Speedracer more rope? Or are we talking about adding some speed bumps that ceph can flip on that might give Joe Speedracer pause? I think this is the fundamental question: who do we give the ammunition to, the user or app writer, or the sysadmin? One might argue that we gave the user a similar power with O_NOATIME (the power to break applications that assume atime is accurate). Here we give developers/users the power to not update mtime and suffer the consequences (like, obviously, breaking mtime-based backups). It should be pretty obvious to anyone using the flag what the consequences are. Note that we can suffer similar lapses in mtime with fdatasync followed by a system crash. And as Andy points out it's semi-broken for writable mmap. The crash case is obviously a slightly different thing, but the idea that mtime can't always be trusted certainly isn't crazy talk. Or, we can be conservative and require a mount option so that the admin has to explicitly allow behavior that might break some existing assumptions about mtime/ctime ('-o user_noatime' I guess?). I'm happy either way, so long as in the end an unprivileged ceph daemon avoids the useless work. In our case we always own the entire mount/disk, so a mount option is just fine. So, what is the expectation here for filesystems that cannot support this flag? NFSv3 in particular would break pretty catastrophically if someone decided on a whim to turn off mtime: they will have turned off the client's ability to detect cache incoherencies. Is this based on mtime or ctime? If the former, would things could also break if a user does, say, some stat(2), write(2), utimes(2) shenanigans? So, my assumption is that if the mount option isn't there allowing this then O_NOMTIME would be a no-op (as opposed to EPERM or something)... but maybe that's not the right thing to do. Whatever we do there, though, I suppose NFS would do the same thing? sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Sat, 9 May 2015, Dave Chinner wrote: On Thu, May 07, 2015 at 09:23:24PM -0400, Trond Myklebust wrote: On Thu, May 7, 2015 at 9:01 PM, Sage Weil s...@newdream.net wrote: On Thu, 7 May 2015, Zach Brown wrote: On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: The criteria for using O_NOMTIME is the same as for using O_NOATIME: owning the file or having the CAP_FOWNER capability. If we're not comfortable allowing owners to prevent mtime/ctime updates then we should add a tunable to allow O_NOMTIME. Maybe a mount option? I dislike turn off safety for performance options because Joe SpeedRacer will always select performance over safety. Well, for ceph there's no safety concern. They never use cmtime in these files. So are you suggesting not implementing this and making them rework their IO paths to avoid the fs maintaining mtime so that we don't give Joe Speedracer more rope? Or are we talking about adding some speed bumps that ceph can flip on that might give Joe Speedracer pause? I think this is the fundamental question: who do we give the ammunition to, the user or app writer, or the sysadmin? One might argue that we gave the user a similar power with O_NOATIME (the power to break applications that assume atime is accurate). Here we give developers/users the power to not update mtime and suffer the consequences (like, obviously, breaking mtime-based backups). It should be pretty obvious to anyone using the flag what the consequences are. Note that we can suffer similar lapses in mtime with fdatasync followed by a system crash. And as Andy points out it's semi-broken for writable mmap. The crash case is obviously a slightly different thing, but the idea that mtime can't always be trusted certainly isn't crazy talk. Or, we can be conservative and require a mount option so that the admin has to explicitly allow behavior that might break some existing assumptions about mtime/ctime ('-o user_noatime' I guess?). I'm happy either way, so long as in the end an unprivileged ceph daemon avoids the useless work. In our case we always own the entire mount/disk, so a mount option is just fine. So, what is the expectation here for filesystems that cannot support this flag? NFSv3 in particular would break pretty catastrophically if someone decided on a whim to turn off mtime: they will have turned off the client's ability to detect cache incoherencies. It's worse than that, now that I think about it. I think nomtime will break nfsv4 as the I_VERSION check is done *after* the NO[C]MTIME checks. e.g. the atomic change count used to detect file changes is only updated during the mtime update on write() calls in XFS. i.e. when the timestamp is changed, a transaction to change mtime is run, and that transaction commit bumps the change count. So cutting out mtime updates at the VFS will prevent XFS and other I_VERSION aware filesystems from updating the change count that NFSv4 clients rely on to detect foreign data changes in a file. Not sure what to do here, because the current NOCMTIME implementation intentionally cuts out the timestamp update because it's usage is fully invisible IO. i.e. it is used by utilities like xfs_fsr and HSMs to move data into and out of files without the application being able to detect the data movement in any way. These are not data modification operations, though - the file contents as read by the application do not change despite the fact we are moving data in and out of the file. In this case we don't want timestamps or change counters to change on the data movement, so I think we've actually got a difference in behaviour here between O_NOMTIME and O_NOCMTIME, right? i.e. for nfsv4 sanity O_NOMTIME still needs to bump I_VERSION on write, just not modify the timestamp? In which case, not modifying the timestamps gains us nothing, because the inode is still dirtied? Right: if we dirty the inode we've defeated the purpose of the patch. The list of caveats on O_NOMTIME seems to be growing... ...and remain consistent with our goals. We couldn't care less if NFS or backup software or anything else doesn't notice these changes. This is private data that is wholly managed by the ceph daemon. The goal is to derive *some* value from the file system and avoid reimplementing it in userspace (without the bits we don't need). I'm sure you realize what we're try to achieve is the same invisible IO that the XFS open by handle ioctls do by default. Would you be more comfortable if this option where only available to the generic open_by_handle syscall, and not to open(2)? sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Thu, 7 May 2015, Zach Brown wrote: > On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: > > On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > owning the file or having the CAP_FOWNER capability. If we're not > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > I dislike "turn off safety for performance" options because Joe > > SpeedRacer will always select performance over safety. > > Well, for ceph there's no safety concern. They never use cmtime in > these files. > > So are you suggesting not implementing this and making them rework their > IO paths to avoid the fs maintaining mtime so that we don't give Joe > Speedracer more rope? Or are we talking about adding some speed bumps > that ceph can flip on that might give Joe Speedracer pause? I think this is the fundamental question: who do we give the ammunition to, the user or app writer, or the sysadmin? One might argue that we gave the user a similar power with O_NOATIME (the power to break applications that assume atime is accurate). Here we give developers/users the power to not update mtime and suffer the consequences (like, obviously, breaking mtime-based backups). It should be pretty obvious to anyone using the flag what the consequences are. Note that we can suffer similar lapses in mtime with fdatasync followed by a system crash. And as Andy points out it's semi-broken for writable mmap. The crash case is obviously a slightly different thing, but the idea that mtime can't always be trusted certainly isn't crazy talk. Or, we can be conservative and require a mount option so that the admin has to explicitly allow behavior that might break some existing assumptions about mtime/ctime ('-o user_noatime' I guess?). I'm happy either way, so long as in the end an unprivileged ceph daemon avoids the useless work. In our case we always own the entire mount/disk, so a mount option is just fine. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Thu, 7 May 2015, Zach Brown wrote: On Thu, May 07, 2015 at 10:26:17AM +1000, Dave Chinner wrote: On Wed, May 06, 2015 at 03:00:12PM -0700, Zach Brown wrote: The criteria for using O_NOMTIME is the same as for using O_NOATIME: owning the file or having the CAP_FOWNER capability. If we're not comfortable allowing owners to prevent mtime/ctime updates then we should add a tunable to allow O_NOMTIME. Maybe a mount option? I dislike turn off safety for performance options because Joe SpeedRacer will always select performance over safety. Well, for ceph there's no safety concern. They never use cmtime in these files. So are you suggesting not implementing this and making them rework their IO paths to avoid the fs maintaining mtime so that we don't give Joe Speedracer more rope? Or are we talking about adding some speed bumps that ceph can flip on that might give Joe Speedracer pause? I think this is the fundamental question: who do we give the ammunition to, the user or app writer, or the sysadmin? One might argue that we gave the user a similar power with O_NOATIME (the power to break applications that assume atime is accurate). Here we give developers/users the power to not update mtime and suffer the consequences (like, obviously, breaking mtime-based backups). It should be pretty obvious to anyone using the flag what the consequences are. Note that we can suffer similar lapses in mtime with fdatasync followed by a system crash. And as Andy points out it's semi-broken for writable mmap. The crash case is obviously a slightly different thing, but the idea that mtime can't always be trusted certainly isn't crazy talk. Or, we can be conservative and require a mount option so that the admin has to explicitly allow behavior that might break some existing assumptions about mtime/ctime ('-o user_noatime' I guess?). I'm happy either way, so long as in the end an unprivileged ceph daemon avoids the useless work. In our case we always own the entire mount/disk, so a mount option is just fine. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Wed, 6 May 2015, Zach Brown wrote: > On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote: > > On Wed, 6 May 2015, Trond Myklebust wrote: > > > Hi Zach, > > > > > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown wrote: > > > > > > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > > > greatly reduce the IO overhead of writes to allocated and initialized > > > > regions of files. > > > > > > > > ceph servers can have loads where they perform O_DIRECT overwrites of > > > > allocated file data and then sync to make sure that the O_DIRECT writes > > > > are flushed from write caches. If the writes dirty the inode with mtime > > > > updates then the syncs also write out the metadata needed to track the > > > > inodes which can add significant iop and latency overhead. > > > > > > > > The ceph servers don't use mtime at all. They're using the local file > > > > system as a backing store and any backups would be driven by their upper > > > > level ceph metadata. For ceph, slow IO from mtime updates in the file > > > > system is as daft as if we had block devices slowing down IO for > > > > per-block write timestamps that file systems never use. > > > > > > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > > > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > > > > IO round trips to 1 in ext4. > > > > > > > > file_update_time() checks for O_NOMTIME and aborts the update if it's > > > > set, just like the current check for the in-kernel inode flag > > > > S_NOCMTIME. I didn't update any other mtime update sites. They could be > > > > added as we decide that it's appropriate to do so. > > > > > > > > I opted not to name the flag O_NOCMTIME because I didn't want the name > > > > to imply that ctime updates would be prevented for other inode changes > > > > like updating i_size in truncate. Not updating ctime is a side-effect > > > > of removing mtime updates when it's the only thing changing in the > > > > inode. > > > > > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > > > owning the file or having the CAP_FOWNER capability. If we're not > > > > comfortable allowing owners to prevent mtime/ctime updates then we > > > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > > > > > > > Just out of curiosity, if you need to modify the application anyway, > > > why wouldn't use of fdatasync() when flushing be able to offer a > > > similar performance boost? > > > > Although fdatasync(2) doesn't have to update synchronously, it does > > eventually get written, and that can trigger lots of unwanted IO. > > And the unwanted IO is per file. Are there circumstances where the > write:file ratio is small enough that dirty inode writes could start to > add up to meaningful write amplification? Yeah, exactly: in some not-so-uncommon workloads it's approaching 1:1. sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Wed, 6 May 2015, Trond Myklebust wrote: > Hi Zach, > > On Wed, May 6, 2015 at 6:00 PM, Zach Brown wrote: > > > > Add the O_NOMTIME flag which prevents mtime from being updated which can > > greatly reduce the IO overhead of writes to allocated and initialized > > regions of files. > > > > ceph servers can have loads where they perform O_DIRECT overwrites of > > allocated file data and then sync to make sure that the O_DIRECT writes > > are flushed from write caches. If the writes dirty the inode with mtime > > updates then the syncs also write out the metadata needed to track the > > inodes which can add significant iop and latency overhead. > > > > The ceph servers don't use mtime at all. They're using the local file > > system as a backing store and any backups would be driven by their upper > > level ceph metadata. For ceph, slow IO from mtime updates in the file > > system is as daft as if we had block devices slowing down IO for > > per-block write timestamps that file systems never use. > > > > In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a > > sync went from 2 serial write round trips to 1 in XFS and from 4 serial > > IO round trips to 1 in ext4. > > > > file_update_time() checks for O_NOMTIME and aborts the update if it's > > set, just like the current check for the in-kernel inode flag > > S_NOCMTIME. I didn't update any other mtime update sites. They could be > > added as we decide that it's appropriate to do so. > > > > I opted not to name the flag O_NOCMTIME because I didn't want the name > > to imply that ctime updates would be prevented for other inode changes > > like updating i_size in truncate. Not updating ctime is a side-effect > > of removing mtime updates when it's the only thing changing in the > > inode. > > > > The criteria for using O_NOMTIME is the same as for using O_NOATIME: > > owning the file or having the CAP_FOWNER capability. If we're not > > comfortable allowing owners to prevent mtime/ctime updates then we > > should add a tunable to allow O_NOMTIME. Maybe a mount option? > > > > Just out of curiosity, if you need to modify the application anyway, > why wouldn't use of fdatasync() when flushing be able to offer a > similar performance boost? Although fdatasync(2) doesn't have to update synchronously, it does eventually get written, and that can trigger lots of unwanted IO. In practice we fsync(2) to avoid deferred IO that we can't control/bound, but that's a long and sad story. O_NOMTIME would make for a much better ending! sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Wed, 6 May 2015, Zach Brown wrote: On Wed, May 06, 2015 at 03:19:13PM -0700, Sage Weil wrote: On Wed, 6 May 2015, Trond Myklebust wrote: Hi Zach, On Wed, May 6, 2015 at 6:00 PM, Zach Brown z...@redhat.com wrote: Add the O_NOMTIME flag which prevents mtime from being updated which can greatly reduce the IO overhead of writes to allocated and initialized regions of files. ceph servers can have loads where they perform O_DIRECT overwrites of allocated file data and then sync to make sure that the O_DIRECT writes are flushed from write caches. If the writes dirty the inode with mtime updates then the syncs also write out the metadata needed to track the inodes which can add significant iop and latency overhead. The ceph servers don't use mtime at all. They're using the local file system as a backing store and any backups would be driven by their upper level ceph metadata. For ceph, slow IO from mtime updates in the file system is as daft as if we had block devices slowing down IO for per-block write timestamps that file systems never use. In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a sync went from 2 serial write round trips to 1 in XFS and from 4 serial IO round trips to 1 in ext4. file_update_time() checks for O_NOMTIME and aborts the update if it's set, just like the current check for the in-kernel inode flag S_NOCMTIME. I didn't update any other mtime update sites. They could be added as we decide that it's appropriate to do so. I opted not to name the flag O_NOCMTIME because I didn't want the name to imply that ctime updates would be prevented for other inode changes like updating i_size in truncate. Not updating ctime is a side-effect of removing mtime updates when it's the only thing changing in the inode. The criteria for using O_NOMTIME is the same as for using O_NOATIME: owning the file or having the CAP_FOWNER capability. If we're not comfortable allowing owners to prevent mtime/ctime updates then we should add a tunable to allow O_NOMTIME. Maybe a mount option? Just out of curiosity, if you need to modify the application anyway, why wouldn't use of fdatasync() when flushing be able to offer a similar performance boost? Although fdatasync(2) doesn't have to update synchronously, it does eventually get written, and that can trigger lots of unwanted IO. And the unwanted IO is per file. Are there circumstances where the write:file ratio is small enough that dirty inode writes could start to add up to meaningful write amplification? Yeah, exactly: in some not-so-uncommon workloads it's approaching 1:1. sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] vfs: add a O_NOMTIME flag
On Wed, 6 May 2015, Trond Myklebust wrote: Hi Zach, On Wed, May 6, 2015 at 6:00 PM, Zach Brown z...@redhat.com wrote: Add the O_NOMTIME flag which prevents mtime from being updated which can greatly reduce the IO overhead of writes to allocated and initialized regions of files. ceph servers can have loads where they perform O_DIRECT overwrites of allocated file data and then sync to make sure that the O_DIRECT writes are flushed from write caches. If the writes dirty the inode with mtime updates then the syncs also write out the metadata needed to track the inodes which can add significant iop and latency overhead. The ceph servers don't use mtime at all. They're using the local file system as a backing store and any backups would be driven by their upper level ceph metadata. For ceph, slow IO from mtime updates in the file system is as daft as if we had block devices slowing down IO for per-block write timestamps that file systems never use. In simple tests a O_DIRECT|O_NOMTIME overwriting write followed by a sync went from 2 serial write round trips to 1 in XFS and from 4 serial IO round trips to 1 in ext4. file_update_time() checks for O_NOMTIME and aborts the update if it's set, just like the current check for the in-kernel inode flag S_NOCMTIME. I didn't update any other mtime update sites. They could be added as we decide that it's appropriate to do so. I opted not to name the flag O_NOCMTIME because I didn't want the name to imply that ctime updates would be prevented for other inode changes like updating i_size in truncate. Not updating ctime is a side-effect of removing mtime updates when it's the only thing changing in the inode. The criteria for using O_NOMTIME is the same as for using O_NOATIME: owning the file or having the CAP_FOWNER capability. If we're not comfortable allowing owners to prevent mtime/ctime updates then we should add a tunable to allow O_NOMTIME. Maybe a mount option? Just out of curiosity, if you need to modify the application anyway, why wouldn't use of fdatasync() when flushing be able to offer a similar performance boost? Although fdatasync(2) doesn't have to update synchronously, it does eventually get written, and that can trigger lots of unwanted IO. In practice we fsync(2) to avoid deferred IO that we can't control/bound, but that's a long and sad story. O_NOMTIME would make for a much better ending! sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph RBD fix for -rc2
Hi Linus, Please pull the following RBD fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus Thanks! sage Ilya Dryomov (1): rbd: end I/O the entire obj_request on error drivers/block/rbd.c |5 + 1 file changed, 5 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph RBD fix for -rc2
Hi Linus, Please pull the following RBD fix from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus Thanks! sage Ilya Dryomov (1): rbd: end I/O the entire obj_request on error drivers/block/rbd.c |5 + 1 file changed, 5 insertions(+) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for 4.1-rc1
Hi Linux, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This time around we have a collection of CephFS fixes from Zheng around MDS failure handling and snapshots, support for a new CRUSH straw2 algorithm (to sync up with userspace) and several RBD cleanups and fixes from Ilya, an error path leak fix from Taesoo, and then an assorted collection of cleanups from others. Thanks! sage Fabian Frederick (1): ceph: remove redundant declaration Ilya Dryomov (12): rbd: be more informative on -ENOENT failures libceph: don't overwrite specific con error msgs rbd: mark block queue as non-rotational libceph, ceph: split ceph_show_options() libceph: expose client options through debugfs ceph: show non-default options only libceph: simplify our debugfs attr macro crush: drop unnecessary include from mapper.c crush: ensuring at most num-rep osds are selected crush: straw2 bucket type with an efficient 64-bit crush_ln() libceph: announce support for straw2 buckets rbd: rbd_wq comment is obsolete Joe Perches (1): libceph: osdmap.h: Add missing format newlines Nicholas Mc Guire (2): ceph: use msecs_to_jiffies for time conversion ceph: match wait_for_completion_timeout return type Sanidhya Kashyap (1): ceph: kstrdup() memory handling Taesoo Kim (1): ceph: properly release page upon error Yan, Zheng (10): ceph: drop cap releases in requests composed before cap reconnect ceph: fix dcache/nocache mount option ceph: keep i_snap_realm while there are writers ceph: don't mark dirty caps when there is no auth cap ceph: don't zero i_wrbuffer_ref when reconnecting is denied ceph: cleanup unsafe requests when reconnecting is denied ceph: hold on to exclusive caps on complete directories ceph: fix null pointer dereference in send_mds_reconnect() ceph: rename snapshot support ceph: fix uninline data function drivers/block/rbd.c| 26 -- fs/ceph/addr.c | 38 ++--- fs/ceph/caps.c | 51 --- fs/ceph/dir.c | 48 --- fs/ceph/mds_client.c | 61 + fs/ceph/strings.c |1 + fs/ceph/super.c| 56 ++-- fs/ceph/super.h|4 +- fs/ceph/xattr.c| 23 +++-- include/linux/ceph/ceph_features.h | 16 +++- include/linux/ceph/ceph_fs.h |1 + include/linux/ceph/debugfs.h |8 +- include/linux/ceph/libceph.h |2 + include/linux/ceph/osdmap.h|5 +- include/linux/crush/crush.h| 12 ++- net/ceph/ceph_common.c | 37 net/ceph/crush/crush.c | 14 +++ net/ceph/crush/crush_ln_table.h| 166 net/ceph/crush/mapper.c| 118 +++-- net/ceph/debugfs.c | 24 ++ net/ceph/messenger.c | 25 +++--- net/ceph/osdmap.c | 25 ++ 22 files changed, 633 insertions(+), 128 deletions(-) create mode 100644 net/ceph/crush/crush_ln_table.h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for 4.1-rc1
Hi Linux, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This time around we have a collection of CephFS fixes from Zheng around MDS failure handling and snapshots, support for a new CRUSH straw2 algorithm (to sync up with userspace) and several RBD cleanups and fixes from Ilya, an error path leak fix from Taesoo, and then an assorted collection of cleanups from others. Thanks! sage Fabian Frederick (1): ceph: remove redundant declaration Ilya Dryomov (12): rbd: be more informative on -ENOENT failures libceph: don't overwrite specific con error msgs rbd: mark block queue as non-rotational libceph, ceph: split ceph_show_options() libceph: expose client options through debugfs ceph: show non-default options only libceph: simplify our debugfs attr macro crush: drop unnecessary include from mapper.c crush: ensuring at most num-rep osds are selected crush: straw2 bucket type with an efficient 64-bit crush_ln() libceph: announce support for straw2 buckets rbd: rbd_wq comment is obsolete Joe Perches (1): libceph: osdmap.h: Add missing format newlines Nicholas Mc Guire (2): ceph: use msecs_to_jiffies for time conversion ceph: match wait_for_completion_timeout return type Sanidhya Kashyap (1): ceph: kstrdup() memory handling Taesoo Kim (1): ceph: properly release page upon error Yan, Zheng (10): ceph: drop cap releases in requests composed before cap reconnect ceph: fix dcache/nocache mount option ceph: keep i_snap_realm while there are writers ceph: don't mark dirty caps when there is no auth cap ceph: don't zero i_wrbuffer_ref when reconnecting is denied ceph: cleanup unsafe requests when reconnecting is denied ceph: hold on to exclusive caps on complete directories ceph: fix null pointer dereference in send_mds_reconnect() ceph: rename snapshot support ceph: fix uninline data function drivers/block/rbd.c| 26 -- fs/ceph/addr.c | 38 ++--- fs/ceph/caps.c | 51 --- fs/ceph/dir.c | 48 --- fs/ceph/mds_client.c | 61 + fs/ceph/strings.c |1 + fs/ceph/super.c| 56 ++-- fs/ceph/super.h|4 +- fs/ceph/xattr.c| 23 +++-- include/linux/ceph/ceph_features.h | 16 +++- include/linux/ceph/ceph_fs.h |1 + include/linux/ceph/debugfs.h |8 +- include/linux/ceph/libceph.h |2 + include/linux/ceph/osdmap.h|5 +- include/linux/crush/crush.h| 12 ++- net/ceph/ceph_common.c | 37 net/ceph/crush/crush.c | 14 +++ net/ceph/crush/crush_ln_table.h| 166 net/ceph/crush/mapper.c| 118 +++-- net/ceph/debugfs.c | 24 ++ net/ceph/messenger.c | 25 +++--- net/ceph/osdmap.c | 25 ++ 22 files changed, 633 insertions(+), 128 deletions(-) create mode 100644 net/ceph/crush/crush_ln_table.h -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: randconfig build error with next-20150421, in net/ceph
On Tue, 21 Apr 2015, Guenter Roeck wrote: > On Tue, Apr 21, 2015 at 08:10:44AM -0700, Jim Davis wrote: > > Building with the attached random configuration file, > > > > ERROR: "__divdi3" [net/ceph/libceph.ko] undefined! > > Commit 7321f19d ("crush: straw2 bucket type with an efficient 64-bit > crush_ln()"). > > + draw = ln / w; > > where 'ln' is 64 bit. > > Some other oddies in that patch, such as > > +#if defined(__linux__) > +#include > +#elif defined(__FreeBSD__) > +#include > +#endif > > and lots of coding style violations. Thanks for the report--we'll fix it up! sage -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: randconfig build error with next-20150421, in net/ceph
On Tue, 21 Apr 2015, Guenter Roeck wrote: On Tue, Apr 21, 2015 at 08:10:44AM -0700, Jim Davis wrote: Building with the attached random configuration file, ERROR: __divdi3 [net/ceph/libceph.ko] undefined! Commit 7321f19d (crush: straw2 bucket type with an efficient 64-bit crush_ln()). + draw = ln / w; where 'ln' is 64 bit. Some other oddies in that patch, such as +#if defined(__linux__) +#include linux/types.h +#elif defined(__FreeBSD__) +#include sys/types.h +#endif and lots of coding style violations. Thanks for the report--we'll fix it up! sage -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fix for -rc8
Hi Linus, Please pull the following patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This corrects a recent misadventure with __GFP_MEMALLOC and PF_MEMALLOC; it turns out it's not a good fit for RBD and we're better off relying on dirty page throttling. Thanks! sage Ilya Dryomov (1): Revert "libceph: use memalloc flags for net IO" net/ceph/messenger.c |9 + 1 file changed, 1 insertion(+), 8 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fix for -rc8
Hi Linus, Please pull the following patch from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This corrects a recent misadventure with __GFP_MEMALLOC and PF_MEMALLOC; it turns out it's not a good fit for RBD and we're better off relying on dirty page throttling. Thanks! sage Ilya Dryomov (1): Revert libceph: use memalloc flags for net IO net/ceph/messenger.c |9 + 1 file changed, 1 insertion(+), 8 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph changes for {3.20,4.0}-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason. Thanks! sage Chaitanya Huilgol (1): libceph: tcp_nodelay support Christoph Hellwig (1): rbd: convert to blk-mq Ilya Dryomov (7): libceph: nuke pool op infrastructure libceph: use mon_client.c/put_generic_request() more rbd: fix error paths in rbd_dev_refresh() rbd: do not treat standalone as flatten ceph: show nocephx_require_signatures and notcp_nodelay options libceph: fix double __remove_osd() problem libceph: kfree() in put_osd() shouldn't depend on authorizer Rickard Strandqvist (2): rbd: nuke copy_token() ceph: acl: Remove unused function Sage Weil (1): MAINTAINERS: update Ceph and RBD maintainers Yan, Zheng (15): ceph: handle SESSION_FORCE_RO message ceph: properly zero data pages for file holes. ceph: improve reference tracking for snaprealm ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: fix reading inline data when i_size > PAGE_SIZE ceph: fix request time stamp encoding ceph: provide seperate {inode,file}_operations for snapdir client: include kernel version in client metadata ceph: properly mark empty directory as complete ceph: fix atomic_open snapdir ceph: re-send requests when MDS enters reconnecting stage ceph: fix dentry leaks ceph: return error for traceless reply race MAINTAINERS |7 +- drivers/block/rbd.c | 193 +-- fs/ceph/acl.c | 14 --- fs/ceph/addr.c | 19 ++-- fs/ceph/caps.c | 127 +++--- fs/ceph/dir.c | 33 +-- fs/ceph/file.c | 37 +--- fs/ceph/inode.c | 41 + fs/ceph/mds_client.c| 127 +++--- fs/ceph/mds_client.h|2 + fs/ceph/snap.c | 54 +++ fs/ceph/super.c |4 + fs/ceph/super.h |5 +- include/linux/ceph/ceph_fs.h| 37 +--- include/linux/ceph/libceph.h|3 +- include/linux/ceph/messenger.h |4 +- include/linux/ceph/mon_client.h |9 +- net/ceph/ceph_common.c | 16 +++- net/ceph/ceph_strings.c | 14 --- net/ceph/debugfs.c |2 - net/ceph/messenger.c| 14 ++- net/ceph/mon_client.c | 139 +--- net/ceph/osd_client.c | 31 +-- 23 files changed, 444 insertions(+), 488 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph changes for {3.20,4.0}-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason. Thanks! sage Chaitanya Huilgol (1): libceph: tcp_nodelay support Christoph Hellwig (1): rbd: convert to blk-mq Ilya Dryomov (7): libceph: nuke pool op infrastructure libceph: use mon_client.c/put_generic_request() more rbd: fix error paths in rbd_dev_refresh() rbd: do not treat standalone as flatten ceph: show nocephx_require_signatures and notcp_nodelay options libceph: fix double __remove_osd() problem libceph: kfree() in put_osd() shouldn't depend on authorizer Rickard Strandqvist (2): rbd: nuke copy_token() ceph: acl: Remove unused function Sage Weil (1): MAINTAINERS: update Ceph and RBD maintainers Yan, Zheng (15): ceph: handle SESSION_FORCE_RO message ceph: properly zero data pages for file holes. ceph: improve reference tracking for snaprealm ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: fix reading inline data when i_size PAGE_SIZE ceph: fix request time stamp encoding ceph: provide seperate {inode,file}_operations for snapdir client: include kernel version in client metadata ceph: properly mark empty directory as complete ceph: fix atomic_open snapdir ceph: re-send requests when MDS enters reconnecting stage ceph: fix dentry leaks ceph: return error for traceless reply race MAINTAINERS |7 +- drivers/block/rbd.c | 193 +-- fs/ceph/acl.c | 14 --- fs/ceph/addr.c | 19 ++-- fs/ceph/caps.c | 127 +++--- fs/ceph/dir.c | 33 +-- fs/ceph/file.c | 37 +--- fs/ceph/inode.c | 41 + fs/ceph/mds_client.c| 127 +++--- fs/ceph/mds_client.h|2 + fs/ceph/snap.c | 54 +++ fs/ceph/super.c |4 + fs/ceph/super.h |5 +- include/linux/ceph/ceph_fs.h| 37 +--- include/linux/ceph/libceph.h|3 +- include/linux/ceph/messenger.h |4 +- include/linux/ceph/mon_client.h |9 +- net/ceph/ceph_common.c | 16 +++- net/ceph/ceph_strings.c | 14 --- net/ceph/debugfs.c |2 - net/ceph/messenger.c| 14 ++- net/ceph/mon_client.c | 139 +--- net/ceph/osd_client.c | 31 +-- 23 files changed, 444 insertions(+), 488 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc7
Hi Linus, Please pull the following two patches from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus These paches from Ilya finally squash a race condition with layered images that he's been chasing for a while. Thanks! sage Ilya Dryomov (2): rbd: fix rbd_dev_parent_get() when parent_overlap == 0 rbd: drop parent_ref in rbd_dev_unprobe() unconditionally drivers/block/rbd.c | 25 +++-- 1 file changed, 7 insertions(+), 18 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc7
Hi Linus, Please pull the following two patches from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus These paches from Ilya finally squash a race condition with layered images that he's been chasing for a while. Thanks! sage Ilya Dryomov (2): rbd: fix rbd_dev_parent_get() when parent_overlap == 0 rbd: drop parent_ref in rbd_dev_unprobe() unconditionally drivers/block/rbd.c | 25 +++-- 1 file changed, 7 insertions(+), 18 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc4
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client for-linus These are both pretty trivial: a sparse warning fix and size_t printk thing. Thanks! sage Ilya Dryomov (2): ceph: use %zu for len in ceph_fill_inline_data() libceph: fix sparse endianness warnings fs/ceph/addr.c | 2 +- include/linux/ceph/osd_client.h | 4 ++-- net/ceph/auth_x.c | 2 +- net/ceph/mon_client.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc4
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client for-linus These are both pretty trivial: a sparse warning fix and size_t printk thing. Thanks! sage Ilya Dryomov (2): ceph: use %zu for len in ceph_fill_inline_data() libceph: fix sparse endianness warnings fs/ceph/addr.c | 2 +- include/linux/ceph/osd_client.h | 4 ++-- net/ceph/auth_x.c | 2 +- net/ceph/mon_client.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for 3.19-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The big item here is support for inline data for CephFS and for message signatures from Zheng. There are also several bug fixes, including interrupted flock request handling, 0-length xattrs, mksnap, cached readdir results, and a message version compat field. Finally there are several cleanups from Ilya, Dan, and Markus. Note that there is another series coming soon that fixes some bugs in the RBD 'lingering' requests, but it isn't quite ready yet. Thanks! sage Dan Carpenter (1): ceph: do_sync is never initialized Ilya Dryomov (4): libceph: nuke ceph_kvfree() ceph: remove unused stringification macros rbd: don't treat CEPH_OSD_OP_DELETE as extent op libceph: fixup includes in pagelist.h John Spray (2): libceph: update ceph_msg_header structure ceph: message versioning fixes SF Markus Elfring (1): ceph, rbd: delete unnecessary checks before two function calls Yan, Zheng (19): ceph: fix file lock interruption ceph: introduce a new inode flag indicating if cached dentries are ordered libceph: store session key in cephx authorizer libceph: message signature support ceph: introduce global empty snap context libceph: require cephx message signature by default libceph: add SETXATTR/CMPXATTR osd operations support libceph: add CREATE osd operation support libceph: specify position of extent operation ceph: parse inline data in MClientReply and MClientCaps ceph: add inline data to pagecache ceph: use getattr request to fetch inline data ceph: fetch inline data when getting Fcr cap refs ceph: sync read inline data ceph: convert inline data to normal data before data write ceph: flush inline version ceph: support inline data feature ceph: fix mksnap crash ceph: fix setting empty extended attribute drivers/block/rbd.c| 11 +- fs/ceph/addr.c | 273 +++-- fs/ceph/caps.c | 132 ++ fs/ceph/dir.c | 27 ++-- fs/ceph/file.c | 97 +++-- fs/ceph/inode.c| 59 ++-- fs/ceph/locks.c| 64 +++-- fs/ceph/mds_client.c | 41 +- fs/ceph/mds_client.h | 10 ++ fs/ceph/snap.c | 37 - fs/ceph/super.c| 16 ++- fs/ceph/super.h| 55 ++-- fs/ceph/super.h.rej| 10 ++ fs/ceph/xattr.c| 7 +- include/linux/ceph/auth.h | 26 include/linux/ceph/buffer.h| 3 +- include/linux/ceph/ceph_features.h | 1 + include/linux/ceph/ceph_fs.h | 10 +- include/linux/ceph/libceph.h | 2 +- include/linux/ceph/messenger.h | 9 +- include/linux/ceph/msgr.h | 11 +- include/linux/ceph/osd_client.h| 13 +- include/linux/ceph/pagelist.h | 4 +- net/ceph/auth_x.c | 76 ++- net/ceph/auth_x.h | 1 + net/ceph/buffer.c | 4 +- net/ceph/ceph_common.c | 21 +-- net/ceph/messenger.c | 34 - net/ceph/osd_client.c | 118 29 files changed, 992 insertions(+), 180 deletions(-) create mode 100644 fs/ceph/super.h.rej -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph updates for 3.19-rc1
Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The big item here is support for inline data for CephFS and for message signatures from Zheng. There are also several bug fixes, including interrupted flock request handling, 0-length xattrs, mksnap, cached readdir results, and a message version compat field. Finally there are several cleanups from Ilya, Dan, and Markus. Note that there is another series coming soon that fixes some bugs in the RBD 'lingering' requests, but it isn't quite ready yet. Thanks! sage Dan Carpenter (1): ceph: do_sync is never initialized Ilya Dryomov (4): libceph: nuke ceph_kvfree() ceph: remove unused stringification macros rbd: don't treat CEPH_OSD_OP_DELETE as extent op libceph: fixup includes in pagelist.h John Spray (2): libceph: update ceph_msg_header structure ceph: message versioning fixes SF Markus Elfring (1): ceph, rbd: delete unnecessary checks before two function calls Yan, Zheng (19): ceph: fix file lock interruption ceph: introduce a new inode flag indicating if cached dentries are ordered libceph: store session key in cephx authorizer libceph: message signature support ceph: introduce global empty snap context libceph: require cephx message signature by default libceph: add SETXATTR/CMPXATTR osd operations support libceph: add CREATE osd operation support libceph: specify position of extent operation ceph: parse inline data in MClientReply and MClientCaps ceph: add inline data to pagecache ceph: use getattr request to fetch inline data ceph: fetch inline data when getting Fcr cap refs ceph: sync read inline data ceph: convert inline data to normal data before data write ceph: flush inline version ceph: support inline data feature ceph: fix mksnap crash ceph: fix setting empty extended attribute drivers/block/rbd.c| 11 +- fs/ceph/addr.c | 273 +++-- fs/ceph/caps.c | 132 ++ fs/ceph/dir.c | 27 ++-- fs/ceph/file.c | 97 +++-- fs/ceph/inode.c| 59 ++-- fs/ceph/locks.c| 64 +++-- fs/ceph/mds_client.c | 41 +- fs/ceph/mds_client.h | 10 ++ fs/ceph/snap.c | 37 - fs/ceph/super.c| 16 ++- fs/ceph/super.h| 55 ++-- fs/ceph/super.h.rej| 10 ++ fs/ceph/xattr.c| 7 +- include/linux/ceph/auth.h | 26 include/linux/ceph/buffer.h| 3 +- include/linux/ceph/ceph_features.h | 1 + include/linux/ceph/ceph_fs.h | 10 +- include/linux/ceph/libceph.h | 2 +- include/linux/ceph/messenger.h | 9 +- include/linux/ceph/msgr.h | 11 +- include/linux/ceph/osd_client.h| 13 +- include/linux/ceph/pagelist.h | 4 +- net/ceph/auth_x.c | 76 ++- net/ceph/auth_x.h | 1 + net/ceph/buffer.c | 4 +- net/ceph/ceph_common.c | 21 +-- net/ceph/messenger.c | 34 - net/ceph/osd_client.c | 118 29 files changed, 992 insertions(+), 180 deletions(-) create mode 100644 fs/ceph/super.h.rej -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc5
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is an overflow bug fix for cephfs from Zheng, a fix for handling large authentication ticket buffers in libceph from Ilya, and a few fixes for the request handling code from Ilya that affect RBD volumes. Thanks! sage Ilya Dryomov (4): libceph: do not crash on large auth tickets libceph: unlink from o_linger_requests when clearing r_osd libceph: clear r_req_lru_item in __unregister_linger_request() libceph: change from BUG to WARN for __remove_osd() asserts Yan, Zheng (1): ceph: fix flush tid comparision fs/ceph/caps.c|2 +- net/ceph/crypto.c | 169 ++--- net/ceph/osd_client.c |7 +- 3 files changed, 138 insertions(+), 40 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for -rc5
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is an overflow bug fix for cephfs from Zheng, a fix for handling large authentication ticket buffers in libceph from Ilya, and a few fixes for the request handling code from Ilya that affect RBD volumes. Thanks! sage Ilya Dryomov (4): libceph: do not crash on large auth tickets libceph: unlink from o_linger_requests when clearing r_osd libceph: clear r_req_lru_item in __unregister_linger_request() libceph: change from BUG to WARN for __remove_osd() asserts Yan, Zheng (1): ceph: fix flush tid comparision fs/ceph/caps.c|2 +- net/ceph/crypto.c | 169 ++--- net/ceph/osd_client.c |7 +- 3 files changed, 138 insertions(+), 40 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics
On Wed, 5 Nov 2014, Milosz Tanski wrote: > From: Christoph Hellwig > > With the new read/write with flags syscalls we can support a flag > to enable O_DSYNC semantics on a per-operation basis. This ?s > useful to implement protocols like SMB, NFS or SCSI that have such > per-operation flags. > > Example program below: > > cat > pwritev2.c << EOF > > (off_t) val, \ > (off_t) uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof > (long) * 4)) > > static ssize_t > pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags) > { > return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset), >flags); > } > > int main(int argc, char **argv) > { > int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666); > char buf[1024]; > struct iovec iov = { .iov_base = buf, .iov_len = 1024 }; > int ret; > > if (fd < 0) { > perror("open"); > return 0; > } > > memset(buf, 0xfe, sizeof(buf)); > > ret = pwritev2(fd, , 1, 0, RWF_DSYNC); > if (ret < 0) > perror("pwritev2"); > else > printf("ret = %d\n", ret); > > return 0; > } > EOF > > Signed-off-by: Christoph Hellwig > [mil...@adfin.com: added flag check to compat_do_readv_writev()] > Signed-off-by: Milosz Tanski Ceph bits Acked-by: Sage Weil > --- > fs/ceph/file.c | 4 +++- > fs/fuse/file.c | 2 ++ > fs/nfs/file.c | 10 ++ > fs/ocfs2/file.c| 6 -- > fs/read_write.c| 20 +++- > include/linux/fs.h | 3 ++- > mm/filemap.c | 4 +++- > 7 files changed, 35 insertions(+), 14 deletions(-) > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > index b798b5c..2d4e15a 100644 > --- a/fs/ceph/file.c > +++ b/fs/ceph/file.c > @@ -983,7 +983,9 @@ retry_snap: > ceph_put_cap_refs(ci, got); > > if (written >= 0 && > - ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) || > + ((file->f_flags & O_SYNC) || > + IS_SYNC(file->f_mapping->host) || > + (iocb->ki_rwflags & RWF_DSYNC) || >ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) { > err = vfs_fsync_range(file, pos, pos + written - 1, 1); > if (err < 0) > diff --git a/fs/fuse/file.c b/fs/fuse/file.c > index caa8d95..bb4fb23 100644 > --- a/fs/fuse/file.c > +++ b/fs/fuse/file.c > @@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, > struct iov_iter *from) > written += written_buffered; > iocb->ki_pos = pos + written_buffered; > } else { > + if (iocb->ki_rwflags & RWF_DSYNC) > + return -EINVAL; > written = fuse_perform_write(file, mapping, from, pos); > if (written >= 0) > iocb->ki_pos = pos + written; > diff --git a/fs/nfs/file.c b/fs/nfs/file.c > index aa9046f..c59b0b7 100644 > --- a/fs/nfs/file.c > +++ b/fs/nfs/file.c > @@ -652,13 +652,15 @@ static const struct vm_operations_struct > nfs_file_vm_ops = { > .remap_pages = generic_file_remap_pages, > }; > > -static int nfs_need_sync_write(struct file *filp, struct inode *inode) > +static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode) > { > struct nfs_open_context *ctx; > > - if (IS_SYNC(inode) || (filp->f_flags & O_DSYNC)) > + if (IS_SYNC(inode) || > + (iocb->ki_filp->f_flags & O_DSYNC) || > + (iocb->ki_rwflags & RWF_DSYNC)) > return 1; > - ctx = nfs_file_open_context(filp); > + ctx = nfs_file_open_context(iocb->ki_filp); > if (test_bit(NFS_CONTEXT_ERROR_WRITE, >flags) || > nfs_ctx_key_to_expire(ctx)) > return 1; > @@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct > iov_iter *from) > written = result; > > /* Return error values for O_DSYNC and IS_SYNC() */ > - if (result >= 0 && nfs_need_sync_write(file, inode)) { > + if (result >= 0 && nfs_need_sync_write(iocb, inode)) { > int err = vfs_fsync(file, 0); > if (err < 0) > result = err; > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index bb66ca4..8f9a86b 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -23
Re: [PATCH v5 4/7] vfs: RWF_NONBLOCK flag for preadv2
On Wed, 5 Nov 2014, Milosz Tanski wrote: > generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we > only want to read the data if it's already in the page cache. > > Additionally, there are a few filesystems that we have to specifically > bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig > contributed this code. > > Signed-off-by: Milosz Tanski > Reviewed-by: Christoph Hellwig > Reviewed-by: Jeff Moyer Ceph bits Acked-by: Sage Weil > --- > fs/ceph/file.c | 2 ++ > fs/cifs/file.c | 6 ++ > fs/nfs/file.c | 5 - > fs/ocfs2/file.c| 6 ++ > fs/pipe.c | 3 ++- > fs/read_write.c| 38 +- > fs/xfs/xfs_file.c | 4 > include/linux/fs.h | 3 +++ > mm/filemap.c | 18 ++ > mm/shmem.c | 4 > 10 files changed, 74 insertions(+), 15 deletions(-) > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > index d7e0da8..b798b5c 100644 > --- a/fs/ceph/file.c > +++ b/fs/ceph/file.c > @@ -822,6 +822,8 @@ again: > if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 || > (iocb->ki_filp->f_flags & O_DIRECT) || > (fi->flags & CEPH_F_SYNC)) { > + if (iocb->ki_rwflags & O_NONBLOCK) > + return -EAGAIN; > > dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n", >inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len, > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > index 3e4d00a..c485afa 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct > iov_iter *to) > struct cifs_readdata *rdata, *tmp; > struct list_head rdata_list; > > + if (iocb->ki_rwflags & RWF_NONBLOCK) > + return -EAGAIN; > + > len = iov_iter_count(to); > if (!len) > return 0; > @@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter > *to) > ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0)) > return generic_file_read_iter(iocb, to); > > + if (iocb->ki_rwflags & RWF_NONBLOCK) > + return -EAGAIN; > + > /* >* We need to hold the sem to be sure nobody modifies lock list >* with a brlock that prevents reading. > diff --git a/fs/nfs/file.c b/fs/nfs/file.c > index 2ab6f00..aa9046f 100644 > --- a/fs/nfs/file.c > +++ b/fs/nfs/file.c > @@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to) > struct inode *inode = file_inode(iocb->ki_filp); > ssize_t result; > > - if (iocb->ki_filp->f_flags & O_DIRECT) > + if (iocb->ki_filp->f_flags & O_DIRECT) { > + if (iocb->ki_rwflags & O_NONBLOCK) > + return -EAGAIN; > return nfs_file_direct_read(iocb, to, iocb->ki_pos); > + } > > dprintk("NFS: read(%pD2, %zu@%lu)\n", > iocb->ki_filp, > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index 324dc93..bb66ca4 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb, > filp->f_path.dentry->d_name.name, > to->nr_segs); /* GR */ > > + /* > + * No non-blocking reads for ocfs2 for now. Might be doable with > + * non-blocking cluster lock helpers. > + */ > + if (iocb->ki_rwflags & RWF_NONBLOCK) > + return -EAGAIN; > > if (!inode) { > ret = -EINVAL; > diff --git a/fs/pipe.c b/fs/pipe.c > index 21981e5..212bf68 100644 > --- a/fs/pipe.c > +++ b/fs/pipe.c > @@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) >*/ > if (ret) > break; > - if (filp->f_flags & O_NONBLOCK) { > + if ((filp->f_flags & O_NONBLOCK) || > + (iocb->ki_rwflags & RWF_NONBLOCK)) { > ret = -EAGAIN; > break; > } > diff --git a/fs/read_write.c b/fs/read_write.c > index 907735c..cba7d4c 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file > *file, > file_start_write(file); > } > > - if (iter_fn) > + i
Re: [PATCH v5 4/7] vfs: RWF_NONBLOCK flag for preadv2
On Wed, 5 Nov 2014, Milosz Tanski wrote: generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we only want to read the data if it's already in the page cache. Additionally, there are a few filesystems that we have to specifically bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig contributed this code. Signed-off-by: Milosz Tanski mil...@adfin.com Reviewed-by: Christoph Hellwig h...@lst.de Reviewed-by: Jeff Moyer jmo...@redhat.com Ceph bits Acked-by: Sage Weil s...@redhat.com --- fs/ceph/file.c | 2 ++ fs/cifs/file.c | 6 ++ fs/nfs/file.c | 5 - fs/ocfs2/file.c| 6 ++ fs/pipe.c | 3 ++- fs/read_write.c| 38 +- fs/xfs/xfs_file.c | 4 include/linux/fs.h | 3 +++ mm/filemap.c | 18 ++ mm/shmem.c | 4 10 files changed, 74 insertions(+), 15 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index d7e0da8..b798b5c 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -822,6 +822,8 @@ again: if ((got (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 || (iocb-ki_filp-f_flags O_DIRECT) || (fi-flags CEPH_F_SYNC)) { + if (iocb-ki_rwflags O_NONBLOCK) + return -EAGAIN; dout(aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n, inode, ceph_vinop(inode), iocb-ki_pos, (unsigned)len, diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 3e4d00a..c485afa 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to) struct cifs_readdata *rdata, *tmp; struct list_head rdata_list; + if (iocb-ki_rwflags RWF_NONBLOCK) + return -EAGAIN; + len = iov_iter_count(to); if (!len) return 0; @@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to) ((cifs_sb-mnt_cifs_flags CIFS_MOUNT_NOPOSIXBRL) == 0)) return generic_file_read_iter(iocb, to); + if (iocb-ki_rwflags RWF_NONBLOCK) + return -EAGAIN; + /* * We need to hold the sem to be sure nobody modifies lock list * with a brlock that prevents reading. diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 2ab6f00..aa9046f 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to) struct inode *inode = file_inode(iocb-ki_filp); ssize_t result; - if (iocb-ki_filp-f_flags O_DIRECT) + if (iocb-ki_filp-f_flags O_DIRECT) { + if (iocb-ki_rwflags O_NONBLOCK) + return -EAGAIN; return nfs_file_direct_read(iocb, to, iocb-ki_pos); + } dprintk(NFS: read(%pD2, %zu@%lu)\n, iocb-ki_filp, diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 324dc93..bb66ca4 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2472,6 +2472,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb, filp-f_path.dentry-d_name.name, to-nr_segs); /* GR */ + /* + * No non-blocking reads for ocfs2 for now. Might be doable with + * non-blocking cluster lock helpers. + */ + if (iocb-ki_rwflags RWF_NONBLOCK) + return -EAGAIN; if (!inode) { ret = -EINVAL; diff --git a/fs/pipe.c b/fs/pipe.c index 21981e5..212bf68 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to) */ if (ret) break; - if (filp-f_flags O_NONBLOCK) { + if ((filp-f_flags O_NONBLOCK) || + (iocb-ki_rwflags RWF_NONBLOCK)) { ret = -EAGAIN; break; } diff --git a/fs/read_write.c b/fs/read_write.c index 907735c..cba7d4c 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -835,14 +835,19 @@ static ssize_t do_readv_writev(int type, struct file *file, file_start_write(file); } - if (iter_fn) + if (iter_fn) { ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len, pos, iter_fn, flags); - else if (fnv) - ret = do_sync_readv_writev(file, iov, nr_segs, tot_len, - pos, fnv); - else - ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn); + } else { + if (type == READ (flags RWF_NONBLOCK)) + return -EAGAIN; + + if (fnv) + ret = do_sync_readv_writev(file, iov, nr_segs, tot_len
Re: [PATCH v5 7/7] fs: add a flag for per-operation O_DSYNC semantics
On Wed, 5 Nov 2014, Milosz Tanski wrote: From: Christoph Hellwig h...@lst.de With the new read/write with flags syscalls we can support a flag to enable O_DSYNC semantics on a per-operation basis. This ?s useful to implement protocols like SMB, NFS or SCSI that have such per-operation flags. Example program below: cat pwritev2.c EOF (off_t) val, \ (off_t) uint64_t) (val)) (sizeof (long) * 4)) (sizeof (long) * 4)) static ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags) { return syscall(__NR_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset), flags); } int main(int argc, char **argv) { int fd = open(argv[1], O_WRONLY|O_CREAT|O_TRUNC, 0666); char buf[1024]; struct iovec iov = { .iov_base = buf, .iov_len = 1024 }; int ret; if (fd 0) { perror(open); return 0; } memset(buf, 0xfe, sizeof(buf)); ret = pwritev2(fd, iov, 1, 0, RWF_DSYNC); if (ret 0) perror(pwritev2); else printf(ret = %d\n, ret); return 0; } EOF Signed-off-by: Christoph Hellwig h...@lst.de [mil...@adfin.com: added flag check to compat_do_readv_writev()] Signed-off-by: Milosz Tanski mil...@adfin.com Ceph bits Acked-by: Sage Weil s...@redhat.com --- fs/ceph/file.c | 4 +++- fs/fuse/file.c | 2 ++ fs/nfs/file.c | 10 ++ fs/ocfs2/file.c| 6 -- fs/read_write.c| 20 +++- include/linux/fs.h | 3 ++- mm/filemap.c | 4 +++- 7 files changed, 35 insertions(+), 14 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index b798b5c..2d4e15a 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -983,7 +983,9 @@ retry_snap: ceph_put_cap_refs(ci, got); if (written = 0 - ((file-f_flags O_SYNC) || IS_SYNC(file-f_mapping-host) || + ((file-f_flags O_SYNC) || + IS_SYNC(file-f_mapping-host) || + (iocb-ki_rwflags RWF_DSYNC) || ceph_osdmap_flag(osdc-osdmap, CEPH_OSDMAP_NEARFULL))) { err = vfs_fsync_range(file, pos, pos + written - 1, 1); if (err 0) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index caa8d95..bb4fb23 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1248,6 +1248,8 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from) written += written_buffered; iocb-ki_pos = pos + written_buffered; } else { + if (iocb-ki_rwflags RWF_DSYNC) + return -EINVAL; written = fuse_perform_write(file, mapping, from, pos); if (written = 0) iocb-ki_pos = pos + written; diff --git a/fs/nfs/file.c b/fs/nfs/file.c index aa9046f..c59b0b7 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -652,13 +652,15 @@ static const struct vm_operations_struct nfs_file_vm_ops = { .remap_pages = generic_file_remap_pages, }; -static int nfs_need_sync_write(struct file *filp, struct inode *inode) +static int nfs_need_sync_write(struct kiocb *iocb, struct inode *inode) { struct nfs_open_context *ctx; - if (IS_SYNC(inode) || (filp-f_flags O_DSYNC)) + if (IS_SYNC(inode) || + (iocb-ki_filp-f_flags O_DSYNC) || + (iocb-ki_rwflags RWF_DSYNC)) return 1; - ctx = nfs_file_open_context(filp); + ctx = nfs_file_open_context(iocb-ki_filp); if (test_bit(NFS_CONTEXT_ERROR_WRITE, ctx-flags) || nfs_ctx_key_to_expire(ctx)) return 1; @@ -705,7 +707,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from) written = result; /* Return error values for O_DSYNC and IS_SYNC() */ - if (result = 0 nfs_need_sync_write(file, inode)) { + if (result = 0 nfs_need_sync_write(iocb, inode)) { int err = vfs_fsync(file, 0); if (err 0) result = err; diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index bb66ca4..8f9a86b 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2374,8 +2374,10 @@ out_dio: /* buffered aio wouldn't have proper lock coverage today */ BUG_ON(ret == -EIOCBQUEUED !(file-f_flags O_DIRECT)); - if (((file-f_flags O_DSYNC) !direct_io) || IS_SYNC(inode) || - ((file-f_flags O_DIRECT) !direct_io)) { + if (((file-f_flags O_DSYNC) !direct_io) || + IS_SYNC(inode) || + ((file-f_flags O_DIRECT) !direct_io) || + (iocb-ki_rwflags RWF_DSYNC)) { ret = filemap_fdatawrite_range(file-f_mapping, *ppos, *ppos + count - 1); if (ret 0) diff --git a/fs/read_write.c b/fs/read_write.c index cba7d4c..3443265 100644
[GIT PULL] Ceph fixes for 3.18
Hi Linus, Please pull the following fixes for RBD from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is a GFP flag fix from Mike Christie, an error code fix from Jan, and fixes for two unnecessary allocations (kmalloc and workqueue) from Ilya. All are well tested. Ilya has one other fix on the way but it didn't get tested in time. Thanks! sage Ilya Dryomov (2): rbd: use a single workqueue for all devices libceph: eliminate unnecessary allocation in process_one_ticket() Jan Kara (1): rbd: Fix error recovery in rbd_obj_read_sync() Mike Christie (1): libceph: use memalloc flags for net IO drivers/block/rbd.c | 35 +++ net/ceph/auth_x.c| 25 ++--- net/ceph/messenger.c | 10 +- 3 files changed, 38 insertions(+), 32 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Ceph fixes for 3.18
Hi Linus, Please pull the following fixes for RBD from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There is a GFP flag fix from Mike Christie, an error code fix from Jan, and fixes for two unnecessary allocations (kmalloc and workqueue) from Ilya. All are well tested. Ilya has one other fix on the way but it didn't get tested in time. Thanks! sage Ilya Dryomov (2): rbd: use a single workqueue for all devices libceph: eliminate unnecessary allocation in process_one_ticket() Jan Kara (1): rbd: Fix error recovery in rbd_obj_read_sync() Mike Christie (1): libceph: use memalloc flags for net IO drivers/block/rbd.c | 35 +++ net/ceph/auth_x.c| 25 ++--- net/ceph/messenger.c | 10 +- 3 files changed, 38 insertions(+), 32 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/