Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
Hi, On Sun, Mar 19, 2017 at 07:04:13PM +0100, Jan Kara wrote: > How come? In current kernel the call looks like: > > vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt); > > so the path is available there... I've actually quickly checked all > vfs_mknod() callers and they all seem to have path available. terribly sorry, must have misremembered something. Been staring at the code long into the night. You are quite right. But it is an internal mount so userspace never gets the notifications. The same goes for the cloned upper mount in overlayfs. This might be considered ok, because the change is semantically "internal" and does not originate through any userspace-visible mountpoint. Superblock watches would solve this case. Otherwise it seems feasible to pass a path to all VFS functions. Filip
Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
Hi, On Sun, Mar 19, 2017 at 07:04:13PM +0100, Jan Kara wrote: > How come? In current kernel the call looks like: > > vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt); > > so the path is available there... I've actually quickly checked all > vfs_mknod() callers and they all seem to have path available. terribly sorry, must have misremembered something. Been staring at the code long into the night. You are quite right. But it is an internal mount so userspace never gets the notifications. The same goes for the cloned upper mount in overlayfs. This might be considered ok, because the change is semantically "internal" and does not originate through any userspace-visible mountpoint. Superblock watches would solve this case. Otherwise it seems feasible to pass a path to all VFS functions. Filip
Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote: > However if you can really call fsnotify hooks with 'path' available in all > the places, it should be equally hard to just pass 'path' to > vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify > calls into several call sites but keep them local to vfs_(create|mkdir|...) > helpers. Hmm? the problem is: not absolutely all. One illuminating example is the use of vfs_mknod in devtmpfs. There a struct path is not only unavailable but makes not semantic sense: the changes do not go thru any mountpoint. And in general I think there will be situations where you would need to call VFS functions without paths. Thus I suggested either (a) wrapping the VFS functions with path variants, or (b) giving them an optional vfsmount argument that can be set to NULL when it does not make sense Filip
Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote: > However if you can really call fsnotify hooks with 'path' available in all > the places, it should be equally hard to just pass 'path' to > vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify > calls into several call sites but keep them local to vfs_(create|mkdir|...) > helpers. Hmm? the problem is: not absolutely all. One illuminating example is the use of vfs_mknod in devtmpfs. There a struct path is not only unavailable but makes not semantic sense: the changes do not go thru any mountpoint. And in general I think there will be situations where you would need to call VFS functions without paths. Thus I suggested either (a) wrapping the VFS functions with path variants, or (b) giving them an optional vfsmount argument that can be set to NULL when it does not make sense Filip
Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
Hi, On Tue, Mar 14, 2017 at 01:18:01PM +0200, Amir Goldstein wrote: > I claim that fanotify filters event by mount not because it > was a requirement, but because it was an implementation challenge > to do otherwise. > > And I claim that what mount watchers are really interested in is > "all the changes that happen in the file system in the area > that is visible to me through this mount point". > > In other words, an indexer needs to know if files were modified\ > create/deleted if that indexer sits in container host namespace > regardless if those files were modified from within a container > namespace. > > It's not a matter of security/isolation. It's a matter of functionality. > I agree that for some event (e.g. permission events) it is possible > to argue both ways (i.e. that the namespace context should be used > as a filter for events). > But for the new proposed events (FS_MODIFY_DIR), I really don't > see the point in isolation by mount/namespace. there are basically two classes of uses for a fantotify-like interface: (1) Keeping an up-to-date representation of the file system. For this, superblock watches are clearly what you want. * You are interested to know the current state of the filesystem so you need to know about every change, regardless of where it came from. * As I mentioned earlier, in case of remote, ditributed and virtual filesystems, the change might come from within the filesystem itself (if the protocol supports reporting such changes). This can probably be implemented only with superblock-scoped watches because the change is fundamentally not related to any mount. * Some filesystems might also support change journalling and it might be concievable to extend the API in the future to report "past" events (for example by passing sequence number of last seen event or similar). * The argument about containers escaping change notification you mentioned earlier. All those factors speak greatly in favour of superblock watches. (2) Tracking filesystem *activity*. Now you are not building an image of current filesystem state but rather a log of what happened. Perhaps you are also interested in who (user/process/...) did what. Permission events also fit mostly in this category. For those it *might* make sense to have mount-scoped watches, for example if you want to monitor only one container or a subset of processes. We both concentrate on the first but we shouldn't forget about the second, which was one of the original motivations for fanotify. Thus I conclude that it might be desirable to implement mount-scoped filename events in the long run. Even though I agree that the sb-scoped events are more important because they cover more use cases and you can do additional filtering (e.g. by pid) if deemed necessary. This would require: (a) Sprinkling the callers of vfs_* with fanotify calls as I did, or (b) Creating wrapper functions like vfs_path_unlink & co. that would make the necessary fanotify call (and probably tell the lower function not to generate another notification), as I suggested earlier. (c) Give the vfs_* functions an *optional* vfsmount argument. In the end I probably find (c) the most elegant but this can be discussed later, even after your changes are merged. Filip
Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
Hi, On Tue, Mar 14, 2017 at 01:18:01PM +0200, Amir Goldstein wrote: > I claim that fanotify filters event by mount not because it > was a requirement, but because it was an implementation challenge > to do otherwise. > > And I claim that what mount watchers are really interested in is > "all the changes that happen in the file system in the area > that is visible to me through this mount point". > > In other words, an indexer needs to know if files were modified\ > create/deleted if that indexer sits in container host namespace > regardless if those files were modified from within a container > namespace. > > It's not a matter of security/isolation. It's a matter of functionality. > I agree that for some event (e.g. permission events) it is possible > to argue both ways (i.e. that the namespace context should be used > as a filter for events). > But for the new proposed events (FS_MODIFY_DIR), I really don't > see the point in isolation by mount/namespace. there are basically two classes of uses for a fantotify-like interface: (1) Keeping an up-to-date representation of the file system. For this, superblock watches are clearly what you want. * You are interested to know the current state of the filesystem so you need to know about every change, regardless of where it came from. * As I mentioned earlier, in case of remote, ditributed and virtual filesystems, the change might come from within the filesystem itself (if the protocol supports reporting such changes). This can probably be implemented only with superblock-scoped watches because the change is fundamentally not related to any mount. * Some filesystems might also support change journalling and it might be concievable to extend the API in the future to report "past" events (for example by passing sequence number of last seen event or similar). * The argument about containers escaping change notification you mentioned earlier. All those factors speak greatly in favour of superblock watches. (2) Tracking filesystem *activity*. Now you are not building an image of current filesystem state but rather a log of what happened. Perhaps you are also interested in who (user/process/...) did what. Permission events also fit mostly in this category. For those it *might* make sense to have mount-scoped watches, for example if you want to monitor only one container or a subset of processes. We both concentrate on the first but we shouldn't forget about the second, which was one of the original motivations for fanotify. Thus I conclude that it might be desirable to implement mount-scoped filename events in the long run. Even though I agree that the sb-scoped events are more important because they cover more use cases and you can do additional filtering (e.g. by pid) if deemed necessary. This would require: (a) Sprinkling the callers of vfs_* with fanotify calls as I did, or (b) Creating wrapper functions like vfs_path_unlink & co. that would make the necessary fanotify call (and probably tell the lower function not to generate another notification), as I suggested earlier. (c) Give the vfs_* functions an *optional* vfsmount argument. In the end I probably find (c) the most elegant but this can be discussed later, even after your changes are merged. Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Hi, On Tue, Mar 14, 2017 at 03:55:19PM +0200, Amir Goldstein wrote: > Please let me know if that is sufficient for your needs > or if you need me to prepare a version that delivers filename events > without filename info, therefore allowing to merge filename events. > > Sounds to me like if you get an event FAN_DELETE, "aaa", > your implementation can update the db directly without having > to scan the directory, so it should be useful. > For your consideration. both approaches might be useful and there are tradeoffs. Direct updates save us from rescanning but give more events and more context switches. On the other hand, with an unlimited queue and well-mergable events I could e.g. sleep for five seconds each time after emptying the queue, thus giving the events more potential to merge and reducing context switches. But one risks blowing up the queue. However, I have some ideas. Basically we could implement some kind of "bulk read" mode for fanotify that would pass events to userspace only when (a) a given event count thresold is hit (e.g. half the queue limit if queue is limited), or (b) the oldest event is older than a specified maximum latency. That might vary from 1 second to 5 minutes depending on specific application needs (e.g. background backup does not need to be low-latency). This would greatly reduce extra context switches when watching a large portion (or whole) of the file system. All of this presumably could be implemented on top of your suggested patches. Either way, I suggest that implementing the nameless filename events is a good idea. I do not know whether they will be the best choice for my application but they probably will be useful for some applications. Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Hi, On Tue, Mar 14, 2017 at 03:55:19PM +0200, Amir Goldstein wrote: > Please let me know if that is sufficient for your needs > or if you need me to prepare a version that delivers filename events > without filename info, therefore allowing to merge filename events. > > Sounds to me like if you get an event FAN_DELETE, "aaa", > your implementation can update the db directly without having > to scan the directory, so it should be useful. > For your consideration. both approaches might be useful and there are tradeoffs. Direct updates save us from rescanning but give more events and more context switches. On the other hand, with an unlimited queue and well-mergable events I could e.g. sleep for five seconds each time after emptying the queue, thus giving the events more potential to merge and reducing context switches. But one risks blowing up the queue. However, I have some ideas. Basically we could implement some kind of "bulk read" mode for fanotify that would pass events to userspace only when (a) a given event count thresold is hit (e.g. half the queue limit if queue is limited), or (b) the oldest event is older than a specified maximum latency. That might vary from 1 second to 5 minutes depending on specific application needs (e.g. background backup does not need to be low-latency). This would greatly reduce extra context switches when watching a large portion (or whole) of the file system. All of this presumably could be implemented on top of your suggested patches. Either way, I suggest that implementing the nameless filename events is a good idea. I do not know whether they will be the best choice for my application but they probably will be useful for some applications. Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Hi, On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote: > An I am very happy that you used filehandles to keep track of objects > in your example, because it fits my proposal really well. you can read more about the rationales in the WIP draft of my thesis (especially the chapter Identifying Inodes and its surroundings): http://regnarg.cz/tmp/thesis.pdf https://github.com/regnarg/bc Either way, one can never use pathnames as identifiers because in a world with parallel renames on multiple levels of the hierarchy (between which no ordering is guaranteed unless they are cross-dir), pathnames are not even a well-defined concept. NB: I used a simple script for testing that does precisely this (a lot of parallel within-directory renames at many levels): https://github.com/regnarg/bc/blob/master/experiments/fanotify/parallel_renamist.sh It seems like a good stress test for any application that tries to do reliable filesystem monitoring. > See, if you used my proposed API, you would have had > > fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \ > FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH, > O_RDONLY)); Sure, I'll definitely try to implement your interface as an optional backend and mention it in my thesis and definitely give it some heavy testing. > Furthermore, my proposal records the filehandle at the time of the event > and therefore, does not have to keep an elevated refcount of the object > in the events queue. That sounds good but from what I've understood, not all filesystems support handles. Is this a concern? (Maybe the right answer is to extend handle support...) Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Hi, On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote: > An I am very happy that you used filehandles to keep track of objects > in your example, because it fits my proposal really well. you can read more about the rationales in the WIP draft of my thesis (especially the chapter Identifying Inodes and its surroundings): http://regnarg.cz/tmp/thesis.pdf https://github.com/regnarg/bc Either way, one can never use pathnames as identifiers because in a world with parallel renames on multiple levels of the hierarchy (between which no ordering is guaranteed unless they are cross-dir), pathnames are not even a well-defined concept. NB: I used a simple script for testing that does precisely this (a lot of parallel within-directory renames at many levels): https://github.com/regnarg/bc/blob/master/experiments/fanotify/parallel_renamist.sh It seems like a good stress test for any application that tries to do reliable filesystem monitoring. > See, if you used my proposed API, you would have had > > fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \ > FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH, > O_RDONLY)); Sure, I'll definitely try to implement your interface as an optional backend and mention it in my thesis and definitely give it some heavy testing. > Furthermore, my proposal records the filehandle at the time of the event > and therefore, does not have to keep an elevated refcount of the object > in the events queue. That sounds good but from what I've understood, not all filesystems support handles. Is this a concern? (Maybe the right answer is to extend handle support...) Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Hi, On Tue, Mar 14, 2017 at 12:11:40PM +0200, Amir Goldstein wrote: > > - file system indexers / desktop search tools > > - file synchronization tools (like Dropbox, Nextcloud, etc.), > > online backup tools > > This last one is the use case of my employer, Ctera Networks. > Out of curiosity, what is the use case that you are focusing on? I'm working on a file synchronization tool as part of my bachelor thesis at Charles University. When I started (now over a year ago ... long story), there were AFIK no attempts at solving the recursive inotify issue, only a lot of complaints, so I cobbled up together something simple (I'm not a kernel developer by trade, this was my first patch) that would allow me to work on the userspace parts, which are the main bulk. I try to focus on algorithmic and implementation efficiency as opposed to fancy GUIs and similar. I want it to be fast with 100k directories and 10M files in home dir. But it's very WIP so we'll see how that turns out. > I had the feeling that all recursive inotify users are hiding in the shadows, > but was missing more concrete evidence. Yes, even Dropbox uses inotify. They have articles in their help on how to increase inotify.max_user_watches: https://www.dropbox.com/help/145. That's not good PR. ;-) (I'm not affiliated with DB, just pointing that out.) > About the argument of not having to change in-kernel framework, > I don't think it should be a consideration at all. Understood. I tried to stay conservative and non-controversial because I imagined that radical framework changes would take months of discussions (look at the history of statx) and this issue seems to be pressing for quite a lot of people. But rushing is of course not the best strategy either, there are always compromises. > If you don't specify FAN_EVENT_INFO_NAME, you can get filename events > FAN_MOVE|FAN_CREATE|FAN_DELETE without the name. > > What you do get is the file descriptor of the parent. sounds familiar? ;-) I didn't notice this bit. That sounds like a win-win. Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Hi, On Tue, Mar 14, 2017 at 12:11:40PM +0200, Amir Goldstein wrote: > > - file system indexers / desktop search tools > > - file synchronization tools (like Dropbox, Nextcloud, etc.), > > online backup tools > > This last one is the use case of my employer, Ctera Networks. > Out of curiosity, what is the use case that you are focusing on? I'm working on a file synchronization tool as part of my bachelor thesis at Charles University. When I started (now over a year ago ... long story), there were AFIK no attempts at solving the recursive inotify issue, only a lot of complaints, so I cobbled up together something simple (I'm not a kernel developer by trade, this was my first patch) that would allow me to work on the userspace parts, which are the main bulk. I try to focus on algorithmic and implementation efficiency as opposed to fancy GUIs and similar. I want it to be fast with 100k directories and 10M files in home dir. But it's very WIP so we'll see how that turns out. > I had the feeling that all recursive inotify users are hiding in the shadows, > but was missing more concrete evidence. Yes, even Dropbox uses inotify. They have articles in their help on how to increase inotify.max_user_watches: https://www.dropbox.com/help/145. That's not good PR. ;-) (I'm not affiliated with DB, just pointing that out.) > About the argument of not having to change in-kernel framework, > I don't think it should be a consideration at all. Understood. I tried to stay conservative and non-controversial because I imagined that radical framework changes would take months of discussions (look at the history of statx) and this issue seems to be pressing for quite a lot of people. But rushing is of course not the best strategy either, there are always compromises. > If you don't specify FAN_EVENT_INFO_NAME, you can get filename events > FAN_MOVE|FAN_CREATE|FAN_DELETE without the name. > > What you do get is the file descriptor of the parent. sounds familiar? ;-) I didn't notice this bit. That sounds like a win-win. Filip
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
An example userspace program that uses FAN_MODIFY_DIR to reliably keep an up-to-date internal representation of the file system. It uses some filehandle trickery to identify inodes, other heuristics could be also used. --- //#define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include using namespace std; #ifndef FAN_MODIFY_DIR #define FAN_MODIFY_DIR 0x0004 #endif // die-on-error helpers #define CHK(x) ({ __typeof__(x) r = x; if (r == -1) { perror(#x); abort(); } r; }) #define CHKN(x) ({ __typeof__(x) r = x; if (r == NULL) { perror(#x); abort(); } r; }) struct inode_info; struct dentry_info; struct inode_info { ino_t ino; mode_t mode; char handle[MAX_HANDLE_SZ]; set links; mapchildren; // for directory inodes }; struct dentry_info { struct inode_info *parent, *inode; string name; }; map inodes; int root_fd; int fan_fd; bool compare_handles(const void *h1, const void *h2) { const struct file_handle *fh1 = (const struct file_handle*) h1; const struct file_handle *fh2 = (const struct file_handle*) h2; return (fh1->handle_bytes == fh2->handle_bytes && memcmp(h1, h2, fh1->handle_bytes) == 0); } bool handle_valid(void *handle) { int check_fd = open_by_handle_at(root_fd, (struct file_handle*)handle, O_PATH); if (check_fd >= 0) { CHK(close(check_fd)); return true; } else if (errno == ESTALE) { return false; } else { perror("open_by_handle_at"); exit(1); } } // Get the path corresponding to an inode (one of its paths, in the presence of // hardlinks). void inode_path(const struct inode_info *inode, char *buf, size_t bufsiz) { list components; while (true) { if (inode->links.empty()) break; struct dentry_info *dentry = *inode->links.begin(); components.push_front(dentry->name); inode = dentry->parent; } buf[0] = '\0'; for (auto name: components) { int len = snprintf(buf, bufsiz, "/%s", name.c_str()); buf += len; bufsiz -= len; } } void delete_dentry(struct dentry_info *dentry) { assert(dentry->parent->children[dentry->name] == dentry); char path_buf[4096]; inode_path(dentry->parent, path_buf, sizeof(path_buf)); printf("unlinked %s/%s (ino %lu, parent %lu)\n", path_buf, dentry->name.c_str(), dentry->inode->ino, dentry->parent->ino); dentry->parent->children.erase(dentry->name.c_str()); dentry->inode->links.erase(dentry); // TODO: If this was the last dentry pointing to an inode, schedule removing // the inode after a timeout (we cannot remove it immediately because // the zero-link situation might occur during a rename when the source // directory has been processed but the target directory hasn't). delete dentry; } struct dentry_info *add_dentry(struct inode_info *parent, const char *name, struct inode_info *child) { struct dentry_info *dentry = new dentry_info(); dentry->parent = parent; dentry->name = name; dentry->inode = child; parent->children[name] = dentry; child->links.insert(dentry); char path_buf[4096] = "\0"; inode_path(parent, path_buf, sizeof(path_buf)); printf("linked %s/%s (ino %lu, parent %lu)\n", path_buf, name, child->ino, parent->ino); return dentry; } void delete_inode(struct inode_info *inode) { for (auto dentry: inode->links) { delete_dentry(dentry); } delete inode; } // Given a file descriptor, find the corresponding inode object in our database, // or create a new one if it does not exist. An O_PATH fd suffices. struct inode_info *find_inode(int fd) { struct stat st; CHK(fstat(fd, )); char handle[sizeof(struct file_handle) + MAX_HANDLE_SZ]; struct file_handle *fh = (struct file_handle*)handle; fh->handle_bytes = sizeof(handle); int mntid; CHK(name_to_handle_at(fd, "", (struct file_handle*)handle, , AT_EMPTY_PATH)); struct inode_info *info = inodes[st.st_ino]; if (info) { // Handles can refer to the same file despite not being equal. // If the old handle can still be opened, we can be assured // that the inode number has not been recycled. if (compare_handles(handle, info->handle) || handle_valid(info->handle)) { return info; } else { delete_inode(info); info = NULL; } } inodes[st.st_ino] = info = new inode_info(); info->ino = st.st_ino; info->mode = st.st_mode; memcpy(info->handle, handle, fh->handle_bytes); return info; } // Scan directory and update internal filesystem representation accordingly. // Closes `dirfd`. void scan(int dirfd, bool recursive) { struct inode_info *dir =
Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR
An example userspace program that uses FAN_MODIFY_DIR to reliably keep an up-to-date internal representation of the file system. It uses some filehandle trickery to identify inodes, other heuristics could be also used. --- //#define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include using namespace std; #ifndef FAN_MODIFY_DIR #define FAN_MODIFY_DIR 0x0004 #endif // die-on-error helpers #define CHK(x) ({ __typeof__(x) r = x; if (r == -1) { perror(#x); abort(); } r; }) #define CHKN(x) ({ __typeof__(x) r = x; if (r == NULL) { perror(#x); abort(); } r; }) struct inode_info; struct dentry_info; struct inode_info { ino_t ino; mode_t mode; char handle[MAX_HANDLE_SZ]; set links; map children; // for directory inodes }; struct dentry_info { struct inode_info *parent, *inode; string name; }; map inodes; int root_fd; int fan_fd; bool compare_handles(const void *h1, const void *h2) { const struct file_handle *fh1 = (const struct file_handle*) h1; const struct file_handle *fh2 = (const struct file_handle*) h2; return (fh1->handle_bytes == fh2->handle_bytes && memcmp(h1, h2, fh1->handle_bytes) == 0); } bool handle_valid(void *handle) { int check_fd = open_by_handle_at(root_fd, (struct file_handle*)handle, O_PATH); if (check_fd >= 0) { CHK(close(check_fd)); return true; } else if (errno == ESTALE) { return false; } else { perror("open_by_handle_at"); exit(1); } } // Get the path corresponding to an inode (one of its paths, in the presence of // hardlinks). void inode_path(const struct inode_info *inode, char *buf, size_t bufsiz) { list components; while (true) { if (inode->links.empty()) break; struct dentry_info *dentry = *inode->links.begin(); components.push_front(dentry->name); inode = dentry->parent; } buf[0] = '\0'; for (auto name: components) { int len = snprintf(buf, bufsiz, "/%s", name.c_str()); buf += len; bufsiz -= len; } } void delete_dentry(struct dentry_info *dentry) { assert(dentry->parent->children[dentry->name] == dentry); char path_buf[4096]; inode_path(dentry->parent, path_buf, sizeof(path_buf)); printf("unlinked %s/%s (ino %lu, parent %lu)\n", path_buf, dentry->name.c_str(), dentry->inode->ino, dentry->parent->ino); dentry->parent->children.erase(dentry->name.c_str()); dentry->inode->links.erase(dentry); // TODO: If this was the last dentry pointing to an inode, schedule removing // the inode after a timeout (we cannot remove it immediately because // the zero-link situation might occur during a rename when the source // directory has been processed but the target directory hasn't). delete dentry; } struct dentry_info *add_dentry(struct inode_info *parent, const char *name, struct inode_info *child) { struct dentry_info *dentry = new dentry_info(); dentry->parent = parent; dentry->name = name; dentry->inode = child; parent->children[name] = dentry; child->links.insert(dentry); char path_buf[4096] = "\0"; inode_path(parent, path_buf, sizeof(path_buf)); printf("linked %s/%s (ino %lu, parent %lu)\n", path_buf, name, child->ino, parent->ino); return dentry; } void delete_inode(struct inode_info *inode) { for (auto dentry: inode->links) { delete_dentry(dentry); } delete inode; } // Given a file descriptor, find the corresponding inode object in our database, // or create a new one if it does not exist. An O_PATH fd suffices. struct inode_info *find_inode(int fd) { struct stat st; CHK(fstat(fd, )); char handle[sizeof(struct file_handle) + MAX_HANDLE_SZ]; struct file_handle *fh = (struct file_handle*)handle; fh->handle_bytes = sizeof(handle); int mntid; CHK(name_to_handle_at(fd, "", (struct file_handle*)handle, , AT_EMPTY_PATH)); struct inode_info *info = inodes[st.st_ino]; if (info) { // Handles can refer to the same file despite not being equal. // If the old handle can still be opened, we can be assured // that the inode number has not been recycled. if (compare_handles(handle, info->handle) || handle_valid(info->handle)) { return info; } else { delete_inode(info); info = NULL; } } inodes[st.st_ino] = info = new inode_info(); info->ino = st.st_ino; info->mode = st.st_mode; memcpy(info->handle, handle, fh->handle_bytes); return info; } // Scan directory and update internal filesystem representation accordingly. // Closes `dirfd`. void scan(int dirfd, bool recursive) { struct inode_info *dir = find_inode(dirfd); char path_buf[4096] = "\0";
[RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
Besause fanotify requires `struct path`, the event cannot be generated directly in `fsnotify_move` and friends because they only get the inode (and their callers, `vfs_rename` cannot supply any better info). So instead it needs to be generated higher in the call chain, i.e. in the callers of functions like `vfs_rename`. This leads to some code duplication. Currently, there are several places whence functions like `vfs_rename` or `vfs_unlink` are called: * syscall handlers (done) * NFS server (done) * stacked filesystems - ecryptfs (done) - overlayfs (Currently doesn't report even ordinary fanotify events, because it internally clones the upper mount; not sure about the rationale. One can always watch the overlay mount instead.) * few rather minor things - devtmpfs (its internal changes are not tied to any vfsmount so it cannot emit mount-scoped events) - cachefiles (done) - ipc/mqueue.c (done) - fs/nfsd/nfs4recover.c (done) - kernel/bpf/inode.c (done) net/unix/af_unix.c (done) (grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(') Signed-off-by: Filip Štědronský <r.l...@regnarg.cz> --- An alternative might be to create wrapper functions like vfs_path_(rename|unlink|...). They could also take care of calling security_path_(rename|unlink|...), which is currently also up to the indvidual callers (possibly with a flag because it might not be always desired). --- fs/cachefiles/namei.c | 9 +++ fs/ecryptfs/inode.c | 67 +++ fs/namei.c| 23 +- fs/nfsd/nfs4recover.c | 7 ++ fs/nfsd/vfs.c | 24 -- ipc/mqueue.c | 9 +++ kernel/bpf/inode.c| 3 +++ net/unix/af_unix.c| 2 ++ 8 files changed, 141 insertions(+), 3 deletions(-) diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c index 41df8a27d7eb..8c86699424d1 100644 --- a/fs/cachefiles/namei.c +++ b/fs/cachefiles/namei.c @@ -313,6 +313,8 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache, cachefiles_io_error(cache, "Unlink security error"); } else { ret = vfs_unlink(d_inode(dir), rep, NULL); + if (ret == 0) + fsnotify_modify_dir(); if (preemptive) cachefiles_mark_object_buried(cache, rep, why); @@ -418,6 +420,10 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache, if (ret != 0 && ret != -ENOMEM) cachefiles_io_error(cache, "Rename failed with error %d", ret); + if (ret == 0) { + fsnotify_modify_dir(); + fsnotify_modify_dir(_to_graveyard); + } if (preemptive) cachefiles_mark_object_buried(cache, rep, why); @@ -560,6 +566,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent, cachefiles_hist(cachefiles_mkdir_histogram, start); if (ret < 0) goto create_error; + fsnotify_modify_dir(); ASSERT(d_backing_inode(next)); @@ -589,6 +596,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent, cachefiles_hist(cachefiles_create_histogram, start); if (ret < 0) goto create_error; + fsnotify_modify_dir(); ASSERT(d_backing_inode(next)); @@ -779,6 +787,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache, ret = vfs_mkdir(d_inode(dir), subdir, 0700); if (ret < 0) goto mkdir_error; + fsnotify_modify_dir(); ASSERT(d_backing_inode(subdir)); diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index e7413f82d27b..88a41b270bcc 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -29,6 +29,8 @@ #include #include #include +#include +#include #include #include #include @@ -144,16 +146,22 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry, { struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry); struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir); + struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry); + struct path lower_dir_path = {lower_mnt, NULL}; struct dentry *lower_dir_dentry; int rc; dget(lower_dentry); lower_dir_dentry = lock_parent(lower_dentry); + lower_dir_path.dentry = lower_dir_dentry; rc = vfs_unlink(lower_dir
[RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes
Besause fanotify requires `struct path`, the event cannot be generated directly in `fsnotify_move` and friends because they only get the inode (and their callers, `vfs_rename` cannot supply any better info). So instead it needs to be generated higher in the call chain, i.e. in the callers of functions like `vfs_rename`. This leads to some code duplication. Currently, there are several places whence functions like `vfs_rename` or `vfs_unlink` are called: * syscall handlers (done) * NFS server (done) * stacked filesystems - ecryptfs (done) - overlayfs (Currently doesn't report even ordinary fanotify events, because it internally clones the upper mount; not sure about the rationale. One can always watch the overlay mount instead.) * few rather minor things - devtmpfs (its internal changes are not tied to any vfsmount so it cannot emit mount-scoped events) - cachefiles (done) - ipc/mqueue.c (done) - fs/nfsd/nfs4recover.c (done) - kernel/bpf/inode.c (done) net/unix/af_unix.c (done) (grep -rE '\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(') Signed-off-by: Filip Štědronský --- An alternative might be to create wrapper functions like vfs_path_(rename|unlink|...). They could also take care of calling security_path_(rename|unlink|...), which is currently also up to the indvidual callers (possibly with a flag because it might not be always desired). --- fs/cachefiles/namei.c | 9 +++ fs/ecryptfs/inode.c | 67 +++ fs/namei.c| 23 +- fs/nfsd/nfs4recover.c | 7 ++ fs/nfsd/vfs.c | 24 -- ipc/mqueue.c | 9 +++ kernel/bpf/inode.c| 3 +++ net/unix/af_unix.c| 2 ++ 8 files changed, 141 insertions(+), 3 deletions(-) diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c index 41df8a27d7eb..8c86699424d1 100644 --- a/fs/cachefiles/namei.c +++ b/fs/cachefiles/namei.c @@ -313,6 +313,8 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache, cachefiles_io_error(cache, "Unlink security error"); } else { ret = vfs_unlink(d_inode(dir), rep, NULL); + if (ret == 0) + fsnotify_modify_dir(); if (preemptive) cachefiles_mark_object_buried(cache, rep, why); @@ -418,6 +420,10 @@ static int cachefiles_bury_object(struct cachefiles_cache *cache, if (ret != 0 && ret != -ENOMEM) cachefiles_io_error(cache, "Rename failed with error %d", ret); + if (ret == 0) { + fsnotify_modify_dir(); + fsnotify_modify_dir(_to_graveyard); + } if (preemptive) cachefiles_mark_object_buried(cache, rep, why); @@ -560,6 +566,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent, cachefiles_hist(cachefiles_mkdir_histogram, start); if (ret < 0) goto create_error; + fsnotify_modify_dir(); ASSERT(d_backing_inode(next)); @@ -589,6 +596,7 @@ int cachefiles_walk_to_object(struct cachefiles_object *parent, cachefiles_hist(cachefiles_create_histogram, start); if (ret < 0) goto create_error; + fsnotify_modify_dir(); ASSERT(d_backing_inode(next)); @@ -779,6 +787,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache, ret = vfs_mkdir(d_inode(dir), subdir, 0700); if (ret < 0) goto mkdir_error; + fsnotify_modify_dir(); ASSERT(d_backing_inode(subdir)); diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index e7413f82d27b..88a41b270bcc 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -29,6 +29,8 @@ #include #include #include +#include +#include #include #include #include @@ -144,16 +146,22 @@ static int ecryptfs_do_unlink(struct inode *dir, struct dentry *dentry, { struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry); struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir); + struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry); + struct path lower_dir_path = {lower_mnt, NULL}; struct dentry *lower_dir_dentry; int rc; dget(lower_dentry); lower_dir_dentry = lock_parent(lower_dentry); + lower_dir_path.dentry = lower_dir_dentry; rc = vfs_unlink(lower_dir_inode
[RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Fanotify currently does not report directory modification events (rename, unlink, etc.). But these events are essential for about half of concievable fanotify use cases, especially: - file system indexers / desktop search tools - file synchronization tools (like Dropbox, Nextcloud, etc.), online backup tools and pretty much any app that needs to maintain and up-to-date internal representation of current contents of the file system. All applications of the above classes that I'm aware of currently use recursive inotify watches, which do not scale (my home dir has ~200k directories, creating all the watches takes ~2min and eats several tens of MB of kernel memory). There have been many calls for such a feature, pretty much since the creation of fanotify in 2009: * By GNOME developers: https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems * By KDE developers: http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de 'Better support for (desktop) file search / indexing applications' * And others: http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com 'Fanotify mv/rename?' http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty 'Issues with using fanotify for a filesystem indexer' Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the contents of a directory change (a directory entry is added, removed or renamed). This covers all the currently missing events: rename, unlink, mknod, mkdir, rmdir, etc. Included with the event is a file descriptor to the modified directory but no further info. The userspace application is expected to rescan the whole directory and update its model accordingly. This needs to be done carefully to prevent race conditions. A cross-directory rename generates two FAN_MODIFY_DIR events. This minimalistic approach has several advantages: * Does not require changes to the fanotify userspace API or the fsnotify in-kernel framework, apart from adding a new event. Especially doesn't complicate it by adding string fields. * Has simple and clear semantics, even with multiple renames occurring in parallel etc. In case of any inconsistencies, one can simply wait for a while and rescan again. * FAN_MODIFY_DIR events are easily merged in case of multiple operations on a directory (e.g. deleting all files). Signed-off-by: Filip Štědronský <r.l...@regnarg.cz> --- An alternative might be to create wrapper functions like vfs_path_(rename|unlink|...). They could also take care of calling security_path_(rename|unlink|...), which is currently also up to the indvidual callers (possibly with a flag because it might not be always desired). An alternative was proposed by Amir Goldstein in several long series of patches that add superblock-scoped (as opposed to vfsmount-scoped) fanotify watches and specific dentry manipulation events with filenames: http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com There is large but not complete overlap between that proposal and mine (which is was originally created over a year ago, before Amir's work, but never posted). I think the superblock watch idea is very interesting because it might in the future allow reporing fsnotify events from remote and virtual filesystems. So I'm posting this more as a potential source of more ideas for discussion, or a fallback proposal in case Amir's patches don't make it. --- fs/notify/fanotify/fanotify.c| 1 + include/linux/fsnotify.h | 17 + include/linux/fsnotify_backend.h | 1 + include/uapi/linux/fanotify.h| 5 - 4 files changed, 23 insertions(+), 1 deletion(-) diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index bbc175d4213d..5178b06c338c 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS); BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY); + BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR); BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE); BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE); BUILD_BUG_ON(FAN_OPEN != FS_OPEN); diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index b43d3f5bd9ea..00fb87c975d6 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file) } /* + * fsnotify_modifydir - directory contents were changed + * (as a result of rename, creat, unlink, etc.) + */ +static inline void fsnotify_modify_dir(struct path
[RFC 1/2] fanotify: new event FAN_MODIFY_DIR
Fanotify currently does not report directory modification events (rename, unlink, etc.). But these events are essential for about half of concievable fanotify use cases, especially: - file system indexers / desktop search tools - file synchronization tools (like Dropbox, Nextcloud, etc.), online backup tools and pretty much any app that needs to maintain and up-to-date internal representation of current contents of the file system. All applications of the above classes that I'm aware of currently use recursive inotify watches, which do not scale (my home dir has ~200k directories, creating all the watches takes ~2min and eats several tens of MB of kernel memory). There have been many calls for such a feature, pretty much since the creation of fanotify in 2009: * By GNOME developers: https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems * By KDE developers: http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de 'Better support for (desktop) file search / indexing applications' * And others: http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com 'Fanotify mv/rename?' http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty 'Issues with using fanotify for a filesystem indexer' Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the contents of a directory change (a directory entry is added, removed or renamed). This covers all the currently missing events: rename, unlink, mknod, mkdir, rmdir, etc. Included with the event is a file descriptor to the modified directory but no further info. The userspace application is expected to rescan the whole directory and update its model accordingly. This needs to be done carefully to prevent race conditions. A cross-directory rename generates two FAN_MODIFY_DIR events. This minimalistic approach has several advantages: * Does not require changes to the fanotify userspace API or the fsnotify in-kernel framework, apart from adding a new event. Especially doesn't complicate it by adding string fields. * Has simple and clear semantics, even with multiple renames occurring in parallel etc. In case of any inconsistencies, one can simply wait for a while and rescan again. * FAN_MODIFY_DIR events are easily merged in case of multiple operations on a directory (e.g. deleting all files). Signed-off-by: Filip Štědronský --- An alternative might be to create wrapper functions like vfs_path_(rename|unlink|...). They could also take care of calling security_path_(rename|unlink|...), which is currently also up to the indvidual callers (possibly with a flag because it might not be always desired). An alternative was proposed by Amir Goldstein in several long series of patches that add superblock-scoped (as opposed to vfsmount-scoped) fanotify watches and specific dentry manipulation events with filenames: http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com There is large but not complete overlap between that proposal and mine (which is was originally created over a year ago, before Amir's work, but never posted). I think the superblock watch idea is very interesting because it might in the future allow reporing fsnotify events from remote and virtual filesystems. So I'm posting this more as a potential source of more ideas for discussion, or a fallback proposal in case Amir's patches don't make it. --- fs/notify/fanotify/fanotify.c| 1 + include/linux/fsnotify.h | 17 + include/linux/fsnotify_backend.h | 1 + include/uapi/linux/fanotify.h| 5 - 4 files changed, 23 insertions(+), 1 deletion(-) diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c index bbc175d4213d..5178b06c338c 100644 --- a/fs/notify/fanotify/fanotify.c +++ b/fs/notify/fanotify/fanotify.c @@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group *group, BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS); BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY); + BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR); BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE); BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE); BUILD_BUG_ON(FAN_OPEN != FS_OPEN); diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h index b43d3f5bd9ea..00fb87c975d6 100644 --- a/include/linux/fsnotify.h +++ b/include/linux/fsnotify.h @@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file) } /* + * fsnotify_modifydir - directory contents were changed + * (as a result of rename, creat, unlink, etc.) + */ +static inline void fsnotify_modify_dir(struct path *path) +{ + struct inode