Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-20 Thread Filip Štědronský
Hi,

On Sun, Mar 19, 2017 at 07:04:13PM +0100, Jan Kara wrote:
> How come? In current kernel the call looks like:
> 
> vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
> 
> so the path is available there... I've actually quickly checked all
> vfs_mknod() callers and they all seem to have path available.

terribly sorry, must have misremembered something. Been staring at the
code long into the night. You are quite right.

But it is an internal mount so userspace never gets the notifications.
The same goes for the cloned upper mount in overlayfs. This might be
considered ok, because the change is semantically "internal" and does
not originate through any userspace-visible mountpoint. Superblock
watches would solve this case.

Otherwise it seems feasible to pass a path to all VFS functions.

Filip


Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-20 Thread Filip Štědronský
Hi,

On Sun, Mar 19, 2017 at 07:04:13PM +0100, Jan Kara wrote:
> How come? In current kernel the call looks like:
> 
> vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
> 
> so the path is available there... I've actually quickly checked all
> vfs_mknod() callers and they all seem to have path available.

terribly sorry, must have misremembered something. Been staring at the
code long into the night. You are quite right.

But it is an internal mount so userspace never gets the notifications.
The same goes for the cloned upper mount in overlayfs. This might be
considered ok, because the change is semantically "internal" and does
not originate through any userspace-visible mountpoint. Superblock
watches would solve this case.

Otherwise it seems feasible to pass a path to all VFS functions.

Filip


Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-19 Thread Filip Štědronský
On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> However if you can really call fsnotify hooks with 'path' available in all
> the places, it should be equally hard to just pass 'path' to
> vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
> calls into several call sites but keep them local to vfs_(create|mkdir|...)
> helpers. Hmm?

the problem is: not absolutely all. One illuminating example is the use
of vfs_mknod in devtmpfs. There a struct path is not only unavailable
but makes not semantic sense: the changes do not go thru any mountpoint.
And in general I think there will be situations where you would need
to call VFS functions without paths.

Thus I suggested either
(a) wrapping the VFS functions with path variants, or
(b) giving them an optional vfsmount argument that can be set to NULL
when it does not make sense

Filip


Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-19 Thread Filip Štědronský
On Sun, Mar 19, 2017 at 11:19:43AM +0100, Jan Kara wrote:
> However if you can really call fsnotify hooks with 'path' available in all
> the places, it should be equally hard to just pass 'path' to
> vfs_(create|mkdir|...) and that way we don't have to sprinkle fsnotify
> calls into several call sites but keep them local to vfs_(create|mkdir|...)
> helpers. Hmm?

the problem is: not absolutely all. One illuminating example is the use
of vfs_mknod in devtmpfs. There a struct path is not only unavailable
but makes not semantic sense: the changes do not go thru any mountpoint.
And in general I think there will be situations where you would need
to call VFS functions without paths.

Thus I suggested either
(a) wrapping the VFS functions with path variants, or
(b) giving them an optional vfsmount argument that can be set to NULL
when it does not make sense

Filip


Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 01:18:01PM +0200, Amir Goldstein wrote:
> I claim that fanotify filters event by mount not because it
> was a requirement, but because it was an implementation challenge
> to do otherwise.
>
> And I claim that what mount watchers are really interested in is
> "all the changes that happen in the file system in the area
>  that is visible to me through this mount point".
>
> In other words, an indexer needs to know if files were modified\
> create/deleted if that indexer sits in container host namespace
> regardless if those files were modified from within a container
> namespace.
> 
> It's not a matter of security/isolation. It's a matter of functionality.
> I agree that for some event (e.g. permission events) it is possible
> to argue both ways (i.e. that the namespace context should be used
> as a filter for events).
> But for the new proposed events (FS_MODIFY_DIR), I really don't
> see the point in isolation by mount/namespace.

there are basically two classes of uses for a fantotify-like
interface:

(1) Keeping an up-to-date representation of the file system.
For this, superblock watches are clearly what you want.

  * You are interested to know the current state of the
filesystem so you need to know about every change, 
regardless of where it came from.
  * As I mentioned earlier, in case of remote, ditributed
and virtual filesystems, the change might come from
within the filesystem itself (if the protocol supports
reporting such changes). This can probably be
implemented only with superblock-scoped watches because
the change is fundamentally not related to any mount.
  * Some filesystems might also support change journalling
and it might be concievable to extend the API in the
future to report "past" events (for example by passing
sequence number of last seen event or similar).
  * The argument about containers escaping change notification
you mentioned earlier.

All those factors speak greatly in favour of superblock
watches.

(2) Tracking filesystem *activity*. Now you are not building
an image of current filesystem state but rather a log of
what happened. Perhaps you are also interested in who
(user/process/...) did what. Permission events also fit
mostly in this category.

For those it *might* make sense to have mount-scoped
watches, for example if you want to monitor only one
container or a subset of processes.

We both concentrate on the first but we shouldn't forget about
the second, which was one of the original motivations for
fanotify.

Thus I conclude that it might be desirable to implement
mount-scoped filename events in the long run. Even though
I agree that the sb-scoped events are more important because
they cover more use cases and you can do additional filtering
(e.g. by pid) if deemed necessary.

This would require:

(a) Sprinkling the callers of vfs_* with fanotify calls
as I did, or
(b) Creating wrapper functions like vfs_path_unlink & co.
that would make the necessary fanotify call (and probably
tell the lower function not to generate another
notification), as I suggested earlier.
(c) Give the vfs_* functions an *optional* vfsmount argument.

In the end I probably find (c) the most elegant but this
can be discussed later, even after your changes are merged.

Filip


Re: [RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 01:18:01PM +0200, Amir Goldstein wrote:
> I claim that fanotify filters event by mount not because it
> was a requirement, but because it was an implementation challenge
> to do otherwise.
>
> And I claim that what mount watchers are really interested in is
> "all the changes that happen in the file system in the area
>  that is visible to me through this mount point".
>
> In other words, an indexer needs to know if files were modified\
> create/deleted if that indexer sits in container host namespace
> regardless if those files were modified from within a container
> namespace.
> 
> It's not a matter of security/isolation. It's a matter of functionality.
> I agree that for some event (e.g. permission events) it is possible
> to argue both ways (i.e. that the namespace context should be used
> as a filter for events).
> But for the new proposed events (FS_MODIFY_DIR), I really don't
> see the point in isolation by mount/namespace.

there are basically two classes of uses for a fantotify-like
interface:

(1) Keeping an up-to-date representation of the file system.
For this, superblock watches are clearly what you want.

  * You are interested to know the current state of the
filesystem so you need to know about every change, 
regardless of where it came from.
  * As I mentioned earlier, in case of remote, ditributed
and virtual filesystems, the change might come from
within the filesystem itself (if the protocol supports
reporting such changes). This can probably be
implemented only with superblock-scoped watches because
the change is fundamentally not related to any mount.
  * Some filesystems might also support change journalling
and it might be concievable to extend the API in the
future to report "past" events (for example by passing
sequence number of last seen event or similar).
  * The argument about containers escaping change notification
you mentioned earlier.

All those factors speak greatly in favour of superblock
watches.

(2) Tracking filesystem *activity*. Now you are not building
an image of current filesystem state but rather a log of
what happened. Perhaps you are also interested in who
(user/process/...) did what. Permission events also fit
mostly in this category.

For those it *might* make sense to have mount-scoped
watches, for example if you want to monitor only one
container or a subset of processes.

We both concentrate on the first but we shouldn't forget about
the second, which was one of the original motivations for
fanotify.

Thus I conclude that it might be desirable to implement
mount-scoped filename events in the long run. Even though
I agree that the sb-scoped events are more important because
they cover more use cases and you can do additional filtering
(e.g. by pid) if deemed necessary.

This would require:

(a) Sprinkling the callers of vfs_* with fanotify calls
as I did, or
(b) Creating wrapper functions like vfs_path_unlink & co.
that would make the necessary fanotify call (and probably
tell the lower function not to generate another
notification), as I suggested earlier.
(c) Give the vfs_* functions an *optional* vfsmount argument.

In the end I probably find (c) the most elegant but this
can be discussed later, even after your changes are merged.

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 03:55:19PM +0200, Amir Goldstein wrote:
> Please let me know if that is sufficient for your needs
> or if you need me to prepare a version that delivers filename events
> without filename info, therefore allowing to merge filename events.
> 
> Sounds to me like if you get an event FAN_DELETE, "aaa",
> your implementation can update the db directly without having
> to scan the directory, so it should be useful.
> For your consideration.

both approaches might be useful and there are tradeoffs. Direct updates
save us from rescanning but give more events and more context switches.

On the other hand, with an unlimited queue and well-mergable events I
could e.g. sleep for five seconds each time after emptying the queue,
thus giving the events more potential to merge and reducing context
switches.

But one risks blowing up the queue. However, I have some ideas.
Basically we could implement some kind of "bulk read" mode for fanotify
that would pass events to userspace only when
(a) a given event count thresold is hit (e.g. half the queue limit
if queue is limited), or
(b) the oldest event is older than a specified maximum latency. That
might vary from 1 second to 5 minutes depending on specific
application needs (e.g. background backup does not need to be
low-latency).

This would greatly reduce extra context switches when watching
a large portion (or whole) of the file system. All of this presumably
could be implemented on top of your suggested patches.

Either way, I suggest that implementing the nameless filename events
is a good idea. I do not know whether they will be the best choice
for my application but they probably will be useful for some
applications.

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 03:55:19PM +0200, Amir Goldstein wrote:
> Please let me know if that is sufficient for your needs
> or if you need me to prepare a version that delivers filename events
> without filename info, therefore allowing to merge filename events.
> 
> Sounds to me like if you get an event FAN_DELETE, "aaa",
> your implementation can update the db directly without having
> to scan the directory, so it should be useful.
> For your consideration.

both approaches might be useful and there are tradeoffs. Direct updates
save us from rescanning but give more events and more context switches.

On the other hand, with an unlimited queue and well-mergable events I
could e.g. sleep for five seconds each time after emptying the queue,
thus giving the events more potential to merge and reducing context
switches.

But one risks blowing up the queue. However, I have some ideas.
Basically we could implement some kind of "bulk read" mode for fanotify
that would pass events to userspace only when
(a) a given event count thresold is hit (e.g. half the queue limit
if queue is limited), or
(b) the oldest event is older than a specified maximum latency. That
might vary from 1 second to 5 minutes depending on specific
application needs (e.g. background backup does not need to be
low-latency).

This would greatly reduce extra context switches when watching
a large portion (or whole) of the file system. All of this presumably
could be implemented on top of your suggested patches.

Either way, I suggest that implementing the nameless filename events
is a good idea. I do not know whether they will be the best choice
for my application but they probably will be useful for some
applications.

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote:
> An I am very happy that you used filehandles to keep track of objects
> in your example, because it fits my proposal really well.

you can read more about the rationales in the WIP draft of my thesis
(especially the chapter Identifying Inodes and its surroundings):

http://regnarg.cz/tmp/thesis.pdf
https://github.com/regnarg/bc

Either way, one can never use pathnames as identifiers because in
a world with parallel renames on multiple levels of the hierarchy
(between which no ordering is guaranteed unless they are cross-dir),
pathnames are not even a well-defined concept.

NB: I used a simple script for testing that does precisely this
(a lot of parallel within-directory renames at many levels):

https://github.com/regnarg/bc/blob/master/experiments/fanotify/parallel_renamist.sh
It seems like a good stress test for any application that tries to
do reliable filesystem monitoring.

> See, if you used my proposed API, you would have had
> 
>   fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \
>  FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH,
>  O_RDONLY));

Sure, I'll definitely try to implement your interface as an
optional backend and mention it in my thesis and definitely
give it some heavy testing.

> Furthermore, my proposal records the filehandle at the time of the event
> and therefore, does not have to keep an elevated refcount of the object
> in the events queue.

That sounds good but from what I've understood,  not all
filesystems support handles. Is this a concern? (Maybe the
right answer is to extend handle support...)

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 12:40:56PM +0200, Amir Goldstein wrote:
> An I am very happy that you used filehandles to keep track of objects
> in your example, because it fits my proposal really well.

you can read more about the rationales in the WIP draft of my thesis
(especially the chapter Identifying Inodes and its surroundings):

http://regnarg.cz/tmp/thesis.pdf
https://github.com/regnarg/bc

Either way, one can never use pathnames as identifiers because in
a world with parallel renames on multiple levels of the hierarchy
(between which no ordering is guaranteed unless they are cross-dir),
pathnames are not even a well-defined concept.

NB: I used a simple script for testing that does precisely this
(a lot of parallel within-directory renames at many levels):

https://github.com/regnarg/bc/blob/master/experiments/fanotify/parallel_renamist.sh
It seems like a good stress test for any application that tries to
do reliable filesystem monitoring.

> See, if you used my proposed API, you would have had
> 
>   fan_fd = CHK(fanotify_init(FAN_UNLIMITED_QUEUE| \
>  FAN_EVENT_INFO_PARENT | FAN_EVENT_INFO_FH,
>  O_RDONLY));

Sure, I'll definitely try to implement your interface as an
optional backend and mention it in my thesis and definitely
give it some heavy testing.

> Furthermore, my proposal records the filehandle at the time of the event
> and therefore, does not have to keep an elevated refcount of the object
> in the events queue.

That sounds good but from what I've understood,  not all
filesystems support handles. Is this a concern? (Maybe the
right answer is to extend handle support...)

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 12:11:40PM +0200, Amir Goldstein wrote:
> >   - file system indexers / desktop search tools
> >   - file synchronization tools (like Dropbox, Nextcloud, etc.),
> > online backup tools
> 
> This last one is the use case of my employer, Ctera Networks.
> Out of curiosity, what is the use case that you are focusing on?

I'm working on a file synchronization tool as part of my bachelor
thesis at Charles University.

When I started (now over a year ago ... long story), there were AFIK no
attempts at solving the recursive inotify issue, only a lot of
complaints, so I cobbled up together something simple (I'm not a kernel
developer by trade, this was my first patch) that would allow me to
work on the userspace parts, which are the main bulk.

I try to focus on algorithmic and implementation efficiency as opposed
to fancy GUIs and similar. I want it to be fast with 100k directories
and 10M files in home dir. But it's very WIP so we'll see how that turns
out.

> I had the feeling that all recursive inotify users are hiding in the shadows,
> but was missing more concrete evidence.

Yes, even Dropbox uses inotify. They have articles in their help on how
to increase inotify.max_user_watches: https://www.dropbox.com/help/145.
That's not good PR. ;-) (I'm not affiliated with DB, just pointing that
out.)

> About the argument of not having to change in-kernel framework,
> I don't think it should be a consideration at all.

Understood. I tried to stay conservative and non-controversial because I
imagined that radical framework changes would take months of discussions
(look at the history of statx) and this issue seems to be pressing for
quite a lot of people.  But rushing is of course not the best strategy
either, there are always compromises.

> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.
> 
> What you do get is the file descriptor of the parent. sounds familiar? ;-)

I didn't notice this bit. That sounds like a win-win.

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-14 Thread Filip Štědronský
Hi,

On Tue, Mar 14, 2017 at 12:11:40PM +0200, Amir Goldstein wrote:
> >   - file system indexers / desktop search tools
> >   - file synchronization tools (like Dropbox, Nextcloud, etc.),
> > online backup tools
> 
> This last one is the use case of my employer, Ctera Networks.
> Out of curiosity, what is the use case that you are focusing on?

I'm working on a file synchronization tool as part of my bachelor
thesis at Charles University.

When I started (now over a year ago ... long story), there were AFIK no
attempts at solving the recursive inotify issue, only a lot of
complaints, so I cobbled up together something simple (I'm not a kernel
developer by trade, this was my first patch) that would allow me to
work on the userspace parts, which are the main bulk.

I try to focus on algorithmic and implementation efficiency as opposed
to fancy GUIs and similar. I want it to be fast with 100k directories
and 10M files in home dir. But it's very WIP so we'll see how that turns
out.

> I had the feeling that all recursive inotify users are hiding in the shadows,
> but was missing more concrete evidence.

Yes, even Dropbox uses inotify. They have articles in their help on how
to increase inotify.max_user_watches: https://www.dropbox.com/help/145.
That's not good PR. ;-) (I'm not affiliated with DB, just pointing that
out.)

> About the argument of not having to change in-kernel framework,
> I don't think it should be a consideration at all.

Understood. I tried to stay conservative and non-controversial because I
imagined that radical framework changes would take months of discussions
(look at the history of statx) and this issue seems to be pressing for
quite a lot of people.  But rushing is of course not the best strategy
either, there are always compromises.

> If you don't specify FAN_EVENT_INFO_NAME, you can get filename events
> FAN_MOVE|FAN_CREATE|FAN_DELETE without the name.
> 
> What you do get is the file descriptor of the parent. sounds familiar? ;-)

I didn't notice this bit. That sounds like a win-win.

Filip


Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-13 Thread Filip Štědronský
An example userspace program that uses FAN_MODIFY_DIR to reliably keep
an up-to-date internal representation of the file system. It uses some
filehandle trickery to identify inodes, other heuristics could be also
used.

---

//#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
using namespace std;

#ifndef FAN_MODIFY_DIR
#define FAN_MODIFY_DIR 0x0004
#endif

// die-on-error helpers
#define CHK(x) ({ __typeof__(x) r = x; if (r == -1) { perror(#x); abort(); } r; 
})
#define CHKN(x) ({ __typeof__(x) r = x; if (r == NULL) { perror(#x); abort(); } 
r; })
struct inode_info;
struct dentry_info;

struct inode_info {
ino_t ino;
mode_t mode;
char handle[MAX_HANDLE_SZ];
set links;
map children; // for directory inodes
};

struct dentry_info {
struct inode_info *parent, *inode;
string name;
};


map inodes;

int root_fd;
int fan_fd;

bool compare_handles(const void *h1, const void *h2) {
const struct file_handle *fh1 = (const struct file_handle*) h1;
const struct file_handle *fh2 = (const struct file_handle*) h2;
return (fh1->handle_bytes == fh2->handle_bytes
&& memcmp(h1, h2, fh1->handle_bytes) == 0);
}

bool handle_valid(void *handle) {
int check_fd = open_by_handle_at(root_fd, (struct file_handle*)handle, 
O_PATH);
if (check_fd >= 0) {
CHK(close(check_fd));
return true;
} else if (errno == ESTALE) {
return false;
} else {
perror("open_by_handle_at");
exit(1);
}
}

// Get the path corresponding to an inode (one of its paths, in the presence of
// hardlinks).
void inode_path(const struct inode_info *inode, char *buf, size_t bufsiz) {
list components;
while (true) {
if (inode->links.empty()) break;
struct dentry_info *dentry = *inode->links.begin();
components.push_front(dentry->name);
inode = dentry->parent;
}
buf[0] = '\0';
for (auto name: components) {
int len = snprintf(buf, bufsiz, "/%s", name.c_str());
buf += len;
bufsiz -= len;
}
}


void delete_dentry(struct dentry_info *dentry) {
assert(dentry->parent->children[dentry->name] == dentry);

char path_buf[4096];
inode_path(dentry->parent, path_buf, sizeof(path_buf));
printf("unlinked %s/%s (ino %lu, parent %lu)\n", path_buf, 
dentry->name.c_str(),
   dentry->inode->ino, dentry->parent->ino);

dentry->parent->children.erase(dentry->name.c_str());
dentry->inode->links.erase(dentry);
// TODO: If this was the last dentry pointing to an inode, schedule removing
//   the inode after a timeout (we cannot remove it immediately because
//   the zero-link situation might occur during a rename when the source
//   directory has been processed but the target directory hasn't).
delete dentry;
}

struct dentry_info *add_dentry(struct inode_info *parent, const char *name,
struct inode_info *child) {
struct dentry_info *dentry = new dentry_info();
dentry->parent = parent;
dentry->name = name;
dentry->inode = child;
parent->children[name] = dentry;
child->links.insert(dentry);

char path_buf[4096] = "\0";
inode_path(parent, path_buf, sizeof(path_buf));
printf("linked %s/%s (ino %lu, parent %lu)\n", path_buf, name, child->ino, 
parent->ino);

return dentry;
}

void delete_inode(struct inode_info *inode) {
for (auto dentry: inode->links) {
delete_dentry(dentry);
}
delete inode;
}

// Given a file descriptor, find the corresponding inode object in our database,
// or create a new one if it does not exist. An O_PATH fd suffices.
struct inode_info *find_inode(int fd) {
struct stat st;
CHK(fstat(fd, ));
char handle[sizeof(struct file_handle) + MAX_HANDLE_SZ];
struct file_handle *fh = (struct file_handle*)handle;
fh->handle_bytes = sizeof(handle);
int mntid;
CHK(name_to_handle_at(fd, "", (struct file_handle*)handle, ,
AT_EMPTY_PATH));

struct inode_info *info = inodes[st.st_ino];
if (info) {
// Handles can refer to the same file despite not being equal.
// If the old handle can still be opened, we can be assured
// that the inode number has not been recycled.
if (compare_handles(handle, info->handle) || 
handle_valid(info->handle)) {
return info;
} else {
delete_inode(info);
info = NULL;
}
}

inodes[st.st_ino] = info = new inode_info();
info->ino = st.st_ino;
info->mode = st.st_mode;
memcpy(info->handle, handle, fh->handle_bytes);
return info;
}

// Scan directory and update internal filesystem representation accordingly.
// Closes `dirfd`.
void scan(int dirfd, bool recursive) {
struct inode_info *dir = 

Re: [RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-13 Thread Filip Štědronský
An example userspace program that uses FAN_MODIFY_DIR to reliably keep
an up-to-date internal representation of the file system. It uses some
filehandle trickery to identify inodes, other heuristics could be also
used.

---

//#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 
using namespace std;

#ifndef FAN_MODIFY_DIR
#define FAN_MODIFY_DIR 0x0004
#endif

// die-on-error helpers
#define CHK(x) ({ __typeof__(x) r = x; if (r == -1) { perror(#x); abort(); } r; 
})
#define CHKN(x) ({ __typeof__(x) r = x; if (r == NULL) { perror(#x); abort(); } 
r; })
struct inode_info;
struct dentry_info;

struct inode_info {
ino_t ino;
mode_t mode;
char handle[MAX_HANDLE_SZ];
set links;
map children; // for directory inodes
};

struct dentry_info {
struct inode_info *parent, *inode;
string name;
};


map inodes;

int root_fd;
int fan_fd;

bool compare_handles(const void *h1, const void *h2) {
const struct file_handle *fh1 = (const struct file_handle*) h1;
const struct file_handle *fh2 = (const struct file_handle*) h2;
return (fh1->handle_bytes == fh2->handle_bytes
&& memcmp(h1, h2, fh1->handle_bytes) == 0);
}

bool handle_valid(void *handle) {
int check_fd = open_by_handle_at(root_fd, (struct file_handle*)handle, 
O_PATH);
if (check_fd >= 0) {
CHK(close(check_fd));
return true;
} else if (errno == ESTALE) {
return false;
} else {
perror("open_by_handle_at");
exit(1);
}
}

// Get the path corresponding to an inode (one of its paths, in the presence of
// hardlinks).
void inode_path(const struct inode_info *inode, char *buf, size_t bufsiz) {
list components;
while (true) {
if (inode->links.empty()) break;
struct dentry_info *dentry = *inode->links.begin();
components.push_front(dentry->name);
inode = dentry->parent;
}
buf[0] = '\0';
for (auto name: components) {
int len = snprintf(buf, bufsiz, "/%s", name.c_str());
buf += len;
bufsiz -= len;
}
}


void delete_dentry(struct dentry_info *dentry) {
assert(dentry->parent->children[dentry->name] == dentry);

char path_buf[4096];
inode_path(dentry->parent, path_buf, sizeof(path_buf));
printf("unlinked %s/%s (ino %lu, parent %lu)\n", path_buf, 
dentry->name.c_str(),
   dentry->inode->ino, dentry->parent->ino);

dentry->parent->children.erase(dentry->name.c_str());
dentry->inode->links.erase(dentry);
// TODO: If this was the last dentry pointing to an inode, schedule removing
//   the inode after a timeout (we cannot remove it immediately because
//   the zero-link situation might occur during a rename when the source
//   directory has been processed but the target directory hasn't).
delete dentry;
}

struct dentry_info *add_dentry(struct inode_info *parent, const char *name,
struct inode_info *child) {
struct dentry_info *dentry = new dentry_info();
dentry->parent = parent;
dentry->name = name;
dentry->inode = child;
parent->children[name] = dentry;
child->links.insert(dentry);

char path_buf[4096] = "\0";
inode_path(parent, path_buf, sizeof(path_buf));
printf("linked %s/%s (ino %lu, parent %lu)\n", path_buf, name, child->ino, 
parent->ino);

return dentry;
}

void delete_inode(struct inode_info *inode) {
for (auto dentry: inode->links) {
delete_dentry(dentry);
}
delete inode;
}

// Given a file descriptor, find the corresponding inode object in our database,
// or create a new one if it does not exist. An O_PATH fd suffices.
struct inode_info *find_inode(int fd) {
struct stat st;
CHK(fstat(fd, ));
char handle[sizeof(struct file_handle) + MAX_HANDLE_SZ];
struct file_handle *fh = (struct file_handle*)handle;
fh->handle_bytes = sizeof(handle);
int mntid;
CHK(name_to_handle_at(fd, "", (struct file_handle*)handle, ,
AT_EMPTY_PATH));

struct inode_info *info = inodes[st.st_ino];
if (info) {
// Handles can refer to the same file despite not being equal.
// If the old handle can still be opened, we can be assured
// that the inode number has not been recycled.
if (compare_handles(handle, info->handle) || 
handle_valid(info->handle)) {
return info;
} else {
delete_inode(info);
info = NULL;
}
}

inodes[st.st_ino] = info = new inode_info();
info->ino = st.st_ino;
info->mode = st.st_mode;
memcpy(info->handle, handle, fh->handle_bytes);
return info;
}

// Scan directory and update internal filesystem representation accordingly.
// Closes `dirfd`.
void scan(int dirfd, bool recursive) {
struct inode_info *dir = find_inode(dirfd);

char path_buf[4096] = "\0";
 

[RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-13 Thread Filip Štědronský
Besause fanotify requires `struct path`, the event cannot be generated
directly in `fsnotify_move` and friends because they only get the inode
(and their callers, `vfs_rename` cannot supply any better info).
So instead it needs to be generated higher in the call chain, i.e. in
the callers of functions like `vfs_rename`.

This leads to some code duplication. Currently, there are several places
whence functions like `vfs_rename` or `vfs_unlink` are called:

  * syscall handlers (done)
  * NFS server (done)
  * stacked filesystems
  - ecryptfs (done)
  - overlayfs
(Currently doesn't report even ordinary fanotify events, because
 it internally clones the upper mount; not sure about the
 rationale.  One can always watch the overlay mount instead.)
  * few rather minor things
  - devtmpfs
(its internal changes are not tied to any vfsmount so it cannot
 emit mount-scoped events)
  - cachefiles (done)
  - ipc/mqueue.c (done)
  - fs/nfsd/nfs4recover.c (done)
  - kernel/bpf/inode.c (done)
net/unix/af_unix.c (done)

(grep -rE 
'\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')

Signed-off-by: Filip Štědronský <r.l...@regnarg.cz>

---

An alternative might be to create wrapper functions like
vfs_path_(rename|unlink|...). They could also take care of calling
security_path_(rename|unlink|...), which is currently also up to
the indvidual callers (possibly with a flag because it might not
be always desired).
---
 fs/cachefiles/namei.c |  9 +++
 fs/ecryptfs/inode.c   | 67 +++
 fs/namei.c| 23 +-
 fs/nfsd/nfs4recover.c |  7 ++
 fs/nfsd/vfs.c | 24 --
 ipc/mqueue.c  |  9 +++
 kernel/bpf/inode.c|  3 +++
 net/unix/af_unix.c|  2 ++
 8 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 41df8a27d7eb..8c86699424d1 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -313,6 +313,8 @@ static int cachefiles_bury_object(struct cachefiles_cache 
*cache,
cachefiles_io_error(cache, "Unlink security error");
} else {
ret = vfs_unlink(d_inode(dir), rep, NULL);
+   if (ret == 0)
+   fsnotify_modify_dir();
 
if (preemptive)
cachefiles_mark_object_buried(cache, rep, why);
@@ -418,6 +420,10 @@ static int cachefiles_bury_object(struct cachefiles_cache 
*cache,
if (ret != 0 && ret != -ENOMEM)
cachefiles_io_error(cache,
"Rename failed with error %d", ret);
+   if (ret == 0) {
+   fsnotify_modify_dir();
+   fsnotify_modify_dir(_to_graveyard);
+   }
 
if (preemptive)
cachefiles_mark_object_buried(cache, rep, why);
@@ -560,6 +566,7 @@ int cachefiles_walk_to_object(struct cachefiles_object 
*parent,
cachefiles_hist(cachefiles_mkdir_histogram, start);
if (ret < 0)
goto create_error;
+   fsnotify_modify_dir();
 
ASSERT(d_backing_inode(next));
 
@@ -589,6 +596,7 @@ int cachefiles_walk_to_object(struct cachefiles_object 
*parent,
cachefiles_hist(cachefiles_create_histogram, start);
if (ret < 0)
goto create_error;
+   fsnotify_modify_dir();
 
ASSERT(d_backing_inode(next));
 
@@ -779,6 +787,7 @@ struct dentry *cachefiles_get_directory(struct 
cachefiles_cache *cache,
ret = vfs_mkdir(d_inode(dir), subdir, 0700);
if (ret < 0)
goto mkdir_error;
+   fsnotify_modify_dir();
 
ASSERT(d_backing_inode(subdir));
 
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e7413f82d27b..88a41b270bcc 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -29,6 +29,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -144,16 +146,22 @@ static int ecryptfs_do_unlink(struct inode *dir, struct 
dentry *dentry,
 {
struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
+   struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
+   struct path lower_dir_path = {lower_mnt, NULL};
struct dentry *lower_dir_dentry;
int rc;
 
dget(lower_dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+   lower_dir_path.dentry = lower_dir_dentry;
rc = vfs_unlink(lower_dir

[RFC 2/2] fanotify: emit FAN_MODIFY_DIR on filesystem changes

2017-03-13 Thread Filip Štědronský
Besause fanotify requires `struct path`, the event cannot be generated
directly in `fsnotify_move` and friends because they only get the inode
(and their callers, `vfs_rename` cannot supply any better info).
So instead it needs to be generated higher in the call chain, i.e. in
the callers of functions like `vfs_rename`.

This leads to some code duplication. Currently, there are several places
whence functions like `vfs_rename` or `vfs_unlink` are called:

  * syscall handlers (done)
  * NFS server (done)
  * stacked filesystems
  - ecryptfs (done)
  - overlayfs
(Currently doesn't report even ordinary fanotify events, because
 it internally clones the upper mount; not sure about the
 rationale.  One can always watch the overlay mount instead.)
  * few rather minor things
  - devtmpfs
(its internal changes are not tied to any vfsmount so it cannot
 emit mount-scoped events)
  - cachefiles (done)
  - ipc/mqueue.c (done)
  - fs/nfsd/nfs4recover.c (done)
  - kernel/bpf/inode.c (done)
net/unix/af_unix.c (done)

(grep -rE 
'\bvfs_(rename|unlink|mknod|whiteout|create|mkdir|rmdir|symlink|link)\(')

Signed-off-by: Filip Štědronský 

---

An alternative might be to create wrapper functions like
vfs_path_(rename|unlink|...). They could also take care of calling
security_path_(rename|unlink|...), which is currently also up to
the indvidual callers (possibly with a flag because it might not
be always desired).
---
 fs/cachefiles/namei.c |  9 +++
 fs/ecryptfs/inode.c   | 67 +++
 fs/namei.c| 23 +-
 fs/nfsd/nfs4recover.c |  7 ++
 fs/nfsd/vfs.c | 24 --
 ipc/mqueue.c  |  9 +++
 kernel/bpf/inode.c|  3 +++
 net/unix/af_unix.c|  2 ++
 8 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 41df8a27d7eb..8c86699424d1 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -313,6 +313,8 @@ static int cachefiles_bury_object(struct cachefiles_cache 
*cache,
cachefiles_io_error(cache, "Unlink security error");
} else {
ret = vfs_unlink(d_inode(dir), rep, NULL);
+   if (ret == 0)
+   fsnotify_modify_dir();
 
if (preemptive)
cachefiles_mark_object_buried(cache, rep, why);
@@ -418,6 +420,10 @@ static int cachefiles_bury_object(struct cachefiles_cache 
*cache,
if (ret != 0 && ret != -ENOMEM)
cachefiles_io_error(cache,
"Rename failed with error %d", ret);
+   if (ret == 0) {
+   fsnotify_modify_dir();
+   fsnotify_modify_dir(_to_graveyard);
+   }
 
if (preemptive)
cachefiles_mark_object_buried(cache, rep, why);
@@ -560,6 +566,7 @@ int cachefiles_walk_to_object(struct cachefiles_object 
*parent,
cachefiles_hist(cachefiles_mkdir_histogram, start);
if (ret < 0)
goto create_error;
+   fsnotify_modify_dir();
 
ASSERT(d_backing_inode(next));
 
@@ -589,6 +596,7 @@ int cachefiles_walk_to_object(struct cachefiles_object 
*parent,
cachefiles_hist(cachefiles_create_histogram, start);
if (ret < 0)
goto create_error;
+   fsnotify_modify_dir();
 
ASSERT(d_backing_inode(next));
 
@@ -779,6 +787,7 @@ struct dentry *cachefiles_get_directory(struct 
cachefiles_cache *cache,
ret = vfs_mkdir(d_inode(dir), subdir, 0700);
if (ret < 0)
goto mkdir_error;
+   fsnotify_modify_dir();
 
ASSERT(d_backing_inode(subdir));
 
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e7413f82d27b..88a41b270bcc 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -29,6 +29,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -144,16 +146,22 @@ static int ecryptfs_do_unlink(struct inode *dir, struct 
dentry *dentry,
 {
struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
struct inode *lower_dir_inode = ecryptfs_inode_to_lower(dir);
+   struct vfsmount *lower_mnt = ecryptfs_dentry_to_lower_mnt(dentry);
+   struct path lower_dir_path = {lower_mnt, NULL};
struct dentry *lower_dir_dentry;
int rc;
 
dget(lower_dentry);
lower_dir_dentry = lock_parent(lower_dentry);
+   lower_dir_path.dentry = lower_dir_dentry;
rc = vfs_unlink(lower_dir_inode

[RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-13 Thread Filip Štědronský
Fanotify currently does not report directory modification events
(rename, unlink, etc.). But these events are essential for about half of
concievable fanotify use cases, especially:

  - file system indexers / desktop search tools
  - file synchronization tools (like Dropbox, Nextcloud, etc.),
online backup tools

and pretty much any app that needs to maintain and up-to-date internal
representation of current contents of the file system.

All applications of the above classes that I'm aware of currently use
recursive inotify watches, which do not scale (my home dir has ~200k
directories, creating all the watches takes ~2min and eats several tens
of MB of kernel memory).

There have been many calls for such a feature, pretty much since the
creation of fanotify in 2009:
  * By GNOME developers:
https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems
  * By KDE developers:
http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de
'Better support for (desktop) file search / indexing applications'
  * And others:

http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com
'Fanotify mv/rename?'
http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty
'Issues with using fanotify for a filesystem indexer'

Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
contents of a directory change (a directory entry is added, removed or
renamed). This covers all the currently missing events: rename, unlink,
mknod, mkdir, rmdir, etc.

Included with the event is a file descriptor to the modified directory
but no further info. The userspace application is expected to rescan the
whole directory and update its model accordingly. This needs to be done
carefully to prevent race conditions. A cross-directory rename generates
two FAN_MODIFY_DIR events.

This minimalistic approach has several advantages:
  * Does not require changes to the fanotify userspace API or the
fsnotify in-kernel framework, apart from adding a new event.
Especially doesn't complicate it by adding string fields.
  * Has simple and clear semantics, even with multiple renames occurring
in parallel etc. In case of any inconsistencies, one can simply wait
for a while and rescan again.
  * FAN_MODIFY_DIR events are easily merged in case of multiple
operations on a directory (e.g. deleting all files).

Signed-off-by: Filip Štědronský <r.l...@regnarg.cz>

---

An alternative might be to create wrapper functions like
vfs_path_(rename|unlink|...). They could also take care of calling
security_path_(rename|unlink|...), which is currently also up to
the indvidual callers (possibly with a flag because it might not
be always desired).

An alternative was proposed by Amir Goldstein in several long series of
patches that add superblock-scoped (as opposed to vfsmount-scoped)
fanotify watches and specific dentry manipulation events with filenames:


http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com

http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com

http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com

http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com

There is large but not complete overlap between that proposal and
mine (which is was originally created over a year ago, before Amir's
work, but never posted).

I think the superblock watch idea is very interesting because it might
in the future allow reporing fsnotify events from remote and virtual
filesystems. So I'm posting this more as a potential source of more
ideas for discussion, or a fallback proposal in case Amir's patches
don't make it.
---
 fs/notify/fanotify/fanotify.c|  1 +
 include/linux/fsnotify.h | 17 +
 include/linux/fsnotify_backend.h |  1 +
 include/uapi/linux/fanotify.h|  5 -
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index bbc175d4213d..5178b06c338c 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group 
*group,
 
BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
+   BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR);
BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index b43d3f5bd9ea..00fb87c975d6 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file)
 }
 
 /*
+ * fsnotify_modifydir - directory contents were changed
+ * (as a result of rename, creat, unlink, etc.)
+ */
+static inline void fsnotify_modify_dir(struct path

[RFC 1/2] fanotify: new event FAN_MODIFY_DIR

2017-03-13 Thread Filip Štědronský
Fanotify currently does not report directory modification events
(rename, unlink, etc.). But these events are essential for about half of
concievable fanotify use cases, especially:

  - file system indexers / desktop search tools
  - file synchronization tools (like Dropbox, Nextcloud, etc.),
online backup tools

and pretty much any app that needs to maintain and up-to-date internal
representation of current contents of the file system.

All applications of the above classes that I'm aware of currently use
recursive inotify watches, which do not scale (my home dir has ~200k
directories, creating all the watches takes ~2min and eats several tens
of MB of kernel memory).

There have been many calls for such a feature, pretty much since the
creation of fanotify in 2009:
  * By GNOME developers:
https://wiki.gnome.org/BastienNocera/KernelWishlist#VFS.2C_filesystems
  * By KDE developers:
http://lkml.kernel.org/r/201211011352.42476.Martin%40lichtvoll.de
'Better support for (desktop) file search / indexing applications'
  * And others:

http://lkml.kernel.org/r/AANLkTi=owK=WZW4oNtpm5WpAZhqCQUdTR2K5gzJ_MqZ+%40mail.gmail.com
'Fanotify mv/rename?'
http://lkml.kernel.org/r/1238158043.23703.20.camel%40fatty
'Issues with using fanotify for a filesystem indexer'

Add a new fanotify event, FAN_MODIFY_DIR, that is emitted whenever the
contents of a directory change (a directory entry is added, removed or
renamed). This covers all the currently missing events: rename, unlink,
mknod, mkdir, rmdir, etc.

Included with the event is a file descriptor to the modified directory
but no further info. The userspace application is expected to rescan the
whole directory and update its model accordingly. This needs to be done
carefully to prevent race conditions. A cross-directory rename generates
two FAN_MODIFY_DIR events.

This minimalistic approach has several advantages:
  * Does not require changes to the fanotify userspace API or the
fsnotify in-kernel framework, apart from adding a new event.
Especially doesn't complicate it by adding string fields.
  * Has simple and clear semantics, even with multiple renames occurring
in parallel etc. In case of any inconsistencies, one can simply wait
for a while and rescan again.
  * FAN_MODIFY_DIR events are easily merged in case of multiple
operations on a directory (e.g. deleting all files).

Signed-off-by: Filip Štědronský 

---

An alternative might be to create wrapper functions like
vfs_path_(rename|unlink|...). They could also take care of calling
security_path_(rename|unlink|...), which is currently also up to
the indvidual callers (possibly with a flag because it might not
be always desired).

An alternative was proposed by Amir Goldstein in several long series of
patches that add superblock-scoped (as opposed to vfsmount-scoped)
fanotify watches and specific dentry manipulation events with filenames:


http://lkml.kernel.org/r/1481984434-13293-1-git-send-email-amir73il%40gmail.com

http://lkml.kernel.org/r/1482247207-4424-1-git-send-email-amir73il%40gmail.com

http://lkml.kernel.org/r/1476126784-12520-1-git-send-email-amir73il%40gmail.com

http://lkml.kernel.org/r/1489411223-12081-1-git-send-email-amir73il%40gmail.com

There is large but not complete overlap between that proposal and
mine (which is was originally created over a year ago, before Amir's
work, but never posted).

I think the superblock watch idea is very interesting because it might
in the future allow reporing fsnotify events from remote and virtual
filesystems. So I'm posting this more as a potential source of more
ideas for discussion, or a fallback proposal in case Amir's patches
don't make it.
---
 fs/notify/fanotify/fanotify.c|  1 +
 include/linux/fsnotify.h | 17 +
 include/linux/fsnotify_backend.h |  1 +
 include/uapi/linux/fanotify.h|  5 -
 4 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index bbc175d4213d..5178b06c338c 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -186,6 +186,7 @@ static int fanotify_handle_event(struct fsnotify_group 
*group,
 
BUILD_BUG_ON(FAN_ACCESS != FS_ACCESS);
BUILD_BUG_ON(FAN_MODIFY != FS_MODIFY);
+   BUILD_BUG_ON(FAN_MODIFY_DIR != FS_MODIFY_DIR);
BUILD_BUG_ON(FAN_CLOSE_NOWRITE != FS_CLOSE_NOWRITE);
BUILD_BUG_ON(FAN_CLOSE_WRITE != FS_CLOSE_WRITE);
BUILD_BUG_ON(FAN_OPEN != FS_OPEN);
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index b43d3f5bd9ea..00fb87c975d6 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -208,6 +208,23 @@ static inline void fsnotify_modify(struct file *file)
 }
 
 /*
+ * fsnotify_modifydir - directory contents were changed
+ * (as a result of rename, creat, unlink, etc.)
+ */
+static inline void fsnotify_modify_dir(struct path *path)
+{
+   struct inode