Re: [PATCH 0/3] new feature: monitoring page cache events

2016-08-01 Thread Dave Hansen
On 07/30/2016 10:31 AM, George Amvrosiadis wrote:
> Dave, I can produce a patch that adds the extra two tracepoints and exports
> all four tracepoint symbols. This would be a short patch that would just
> extend existing tracing functionality. What do you think?

Adding those tracepoints is probably useful.  It's probably something we
need to have anyway as long as they don't cause too much code bloat or a
noticeable performance impact when they're off.

As for exporting symbols, that's not done until something is merged.


Re: [PATCH 0/3] new feature: monitoring page cache events

2016-08-01 Thread Dave Hansen
On 07/30/2016 10:31 AM, George Amvrosiadis wrote:
> Dave, I can produce a patch that adds the extra two tracepoints and exports
> all four tracepoint symbols. This would be a short patch that would just
> extend existing tracing functionality. What do you think?

Adding those tracepoints is probably useful.  It's probably something we
need to have anyway as long as they don't cause too much code bloat or a
noticeable performance impact when they're off.

As for exporting symbols, that's not done until something is merged.


Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-30 Thread George Amvrosiadis
On Fri, Jul 29, 2016 at 08:33:34AM -0700, Dave Hansen wrote:
> What's to stop you from using tracing to gather and transport data out
> of the kernel and then aggregate and present it to apps in an "elegant"
> way of your choosing?
> 
> I don't think it's really even worth having an in-depth discussion of
> how to modify duet.  I can't imagine that this would get merged as-is,
> or even anything resembling the current design.  If you want to see
> duet-like functionality in the kernel, I think it needs to be integrated
> better and enhance or take advantage of existing mechanisms.
> 
> You've identified a real problem and a real solution, and it is in an
> area where Linux is weak (monitoring the page cache).  If you are really
> interested in seeing a solution that folks can use, I think you need to
> find some way to leverage existing kernel functionality (ftrace,
> fanotify, netlink, etc...), or come up with a much more compelling story
> about why you can't use them.

I took a few measurements of the ftrace overhead, and if limited to the page
cache functions we're interested in, it's very reasonable. Duet does depend
on exporting some data with each event, however, and tracepoints seem to be
the most efficient way to do this. There are two issues, however:

(a) There are no tracepoints for page dirtying and flushing. Those would have
to be added at the same place as the Duet hooks I submitted (unwrapping the
page-flags.h macros) to catch those cases where pages are locked and the dirty
bit is set manually.

(b) The page cache tracepoints are currently not exported symbols. If I can
export those four tracepoints for page addition, removal, dirtying, and
flushing, then the rest of the work (exporting the information to userspace)
can be carried out within a module. In the future, once we reach a point of
maturity where we are confident about the stability of the exporting interface
and performance, we could engage in another conversation about potentially
mainlining some of that code.

Dave, I can produce a patch that adds the extra two tracepoints and exports
all four tracepoint symbols. This would be a short patch that would just
extend existing tracing functionality. What do you think?



Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-30 Thread George Amvrosiadis
On Fri, Jul 29, 2016 at 08:33:34AM -0700, Dave Hansen wrote:
> What's to stop you from using tracing to gather and transport data out
> of the kernel and then aggregate and present it to apps in an "elegant"
> way of your choosing?
> 
> I don't think it's really even worth having an in-depth discussion of
> how to modify duet.  I can't imagine that this would get merged as-is,
> or even anything resembling the current design.  If you want to see
> duet-like functionality in the kernel, I think it needs to be integrated
> better and enhance or take advantage of existing mechanisms.
> 
> You've identified a real problem and a real solution, and it is in an
> area where Linux is weak (monitoring the page cache).  If you are really
> interested in seeing a solution that folks can use, I think you need to
> find some way to leverage existing kernel functionality (ftrace,
> fanotify, netlink, etc...), or come up with a much more compelling story
> about why you can't use them.

I took a few measurements of the ftrace overhead, and if limited to the page
cache functions we're interested in, it's very reasonable. Duet does depend
on exporting some data with each event, however, and tracepoints seem to be
the most efficient way to do this. There are two issues, however:

(a) There are no tracepoints for page dirtying and flushing. Those would have
to be added at the same place as the Duet hooks I submitted (unwrapping the
page-flags.h macros) to catch those cases where pages are locked and the dirty
bit is set manually.

(b) The page cache tracepoints are currently not exported symbols. If I can
export those four tracepoints for page addition, removal, dirtying, and
flushing, then the rest of the work (exporting the information to userspace)
can be carried out within a module. In the future, once we reach a point of
maturity where we are confident about the stability of the exporting interface
and performance, we could engage in another conversation about potentially
mainlining some of that code.

Dave, I can produce a patch that adds the extra two tracepoints and exports
all four tracepoint symbols. This would be a short patch that would just
extend existing tracing functionality. What do you think?



Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-29 Thread Dave Hansen
On 07/28/2016 08:47 PM, George Amvrosiadis wrote:
> On Thu, Jul 28, 2016 at 02:02:45PM -0700, Dave Hansen wrote:
>> On 07/25/2016 08:47 PM, George Amvrosiadis wrote:
>>>  21 files changed, 2424 insertions(+), 1 deletion(-)
>>
>> I like the idea, but yikes, that's a lot of code.
>>
>> Have you considered using or augmenting the kernel's existing tracing
>> mechanisms?  Have you considered using something like netlink for
>> transporting the data out of the kernel?
> 
> We contemplated a couple other solutions. One was extending existing debugging
> mechanisms. E.g., there are already tracepoints at 
> __add_to_page_cache_locked()
> and __delete_from_page_cache(). A consistent reaction I got for doing that,
> however, was that of exposing an unruly interface to user applications, which
> is certainly not an elegant solution.

What's to stop you from using tracing to gather and transport data out
of the kernel and then aggregate and present it to apps in an "elegant"
way of your choosing?

>> The PageDirty() hooks look simple but turn out to be horribly deep.
>> Where we used to have a plain old bit set, we now have new locks,
>> potentially long periods of irq disabling, and loops over all the tasks
>> doing duet, even path lookup!
> 
> It's true that the hooks are deep, but they are fully exercised only once per
> inode, per task. The reasoning behind the 'struct bmap_rbnode'->seen bitmap is
> to remember whether an inode was seen before by a given task. During that 
> first
> access is when we do the path lookup to decide whether this inode is relevant
> to the task and mark 'struct bmap_rbnode'->relv accordingly. If it is not, we
> ignore future events from it. Tasks can also use the 'structu 
> bmap_rbnode'->done
> bitmap to indicate that they are done with a specific inode, which also stops
> those events from passing that task loop.

OK, but it still disables interrupts and takes a spinlock for each
bitmap it checks.  That spinlock becomes essentially a global lock in a
hot path, which can't be good.

>> Given a big system, I would imagine these locks slowing down
>> SetPageDirty() and things like write() pretty severely.  Have you done
>> an assessment of the performance impact of this change?   I can't
>> imagine this being used in any kind of performance or
>> scalability-sensitive environment.
> 
> I have used filebench to saturate an HDD and an SSD, registered a task to be
> notified about every file in the filesystem, and measured no difference in I/O
> throughput. To measure the CPU utilization of Duet, I tried an extreme case
> where I booted using only one core and again saturated an HDD using filebench.
> There was a 1-1.5% increase in CPU utilization. There is a description of this
> result in the paper. I have also tuned filebench to hit the cache often in my
> experiments (more than 60-70% of accesses going to less than 10% of the data),
> but the results were similar. For the Hadoop and Spark experiments we used a
> 24-node cluster and these overhead numbers didn't seem to affect performance.

I'd say testing with _more_ cores is important, not less.  How about
trying to watch a single file, then have one process per core writing to
a 1MB file in a loop.  What does that do?  Or, heck, just compile a
kernel on a modern 2-socket system.

In any case, I still can't see the current duet _model_ ever working
out, much less the implementation posted here.

>> The current tracing code has a model where the trace producers put data
>> in *one* place, then all the mulitple consumers pull it out of that
>> place.  Duet seems to have the model that the producer puts the data in
>> multiple places and consumers consume it from their own private copies.
>>  That seems a bit backwards and puts cost directly in to hot code paths.
>>  Even a single task watching a single file on the system makes everyone
>> go in and pay some of this cost for every SetPageDirty().
> 
> Duet operates in a similar way. There is one large global hash table to avoid
> collisions, so that on average a single lookup is sufficient to place a page 
> in
> it. Due to its global nature, if a page is of interest to multiple tasks, only
> one entry is used to hold the events for that page across all tasks. And to
> avoid walking that hash table for relevant events on a read(), each task
> maintains a separate bitmap of the hash table's buckets that tells it which
> buckets to look into. (In the past I've also tried a work queue approach on 
> the
> hot code path, but the overhead was almost double as a result of allocating 
> the
> work queue items.)

I don't think Duet operates in a similar way.  Asserting that it does
makes me wary that you've understood and actually considered how tracing
works.

Duet takes global locks in hot paths.  The tracing code doesn't do that.
 It uses percpu buffers that don't require shared locks when adding
records.  It makes the reader of the data do all the hard work.

>> Tasks seem to have a fixed 

Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-29 Thread Dave Hansen
On 07/28/2016 08:47 PM, George Amvrosiadis wrote:
> On Thu, Jul 28, 2016 at 02:02:45PM -0700, Dave Hansen wrote:
>> On 07/25/2016 08:47 PM, George Amvrosiadis wrote:
>>>  21 files changed, 2424 insertions(+), 1 deletion(-)
>>
>> I like the idea, but yikes, that's a lot of code.
>>
>> Have you considered using or augmenting the kernel's existing tracing
>> mechanisms?  Have you considered using something like netlink for
>> transporting the data out of the kernel?
> 
> We contemplated a couple other solutions. One was extending existing debugging
> mechanisms. E.g., there are already tracepoints at 
> __add_to_page_cache_locked()
> and __delete_from_page_cache(). A consistent reaction I got for doing that,
> however, was that of exposing an unruly interface to user applications, which
> is certainly not an elegant solution.

What's to stop you from using tracing to gather and transport data out
of the kernel and then aggregate and present it to apps in an "elegant"
way of your choosing?

>> The PageDirty() hooks look simple but turn out to be horribly deep.
>> Where we used to have a plain old bit set, we now have new locks,
>> potentially long periods of irq disabling, and loops over all the tasks
>> doing duet, even path lookup!
> 
> It's true that the hooks are deep, but they are fully exercised only once per
> inode, per task. The reasoning behind the 'struct bmap_rbnode'->seen bitmap is
> to remember whether an inode was seen before by a given task. During that 
> first
> access is when we do the path lookup to decide whether this inode is relevant
> to the task and mark 'struct bmap_rbnode'->relv accordingly. If it is not, we
> ignore future events from it. Tasks can also use the 'structu 
> bmap_rbnode'->done
> bitmap to indicate that they are done with a specific inode, which also stops
> those events from passing that task loop.

OK, but it still disables interrupts and takes a spinlock for each
bitmap it checks.  That spinlock becomes essentially a global lock in a
hot path, which can't be good.

>> Given a big system, I would imagine these locks slowing down
>> SetPageDirty() and things like write() pretty severely.  Have you done
>> an assessment of the performance impact of this change?   I can't
>> imagine this being used in any kind of performance or
>> scalability-sensitive environment.
> 
> I have used filebench to saturate an HDD and an SSD, registered a task to be
> notified about every file in the filesystem, and measured no difference in I/O
> throughput. To measure the CPU utilization of Duet, I tried an extreme case
> where I booted using only one core and again saturated an HDD using filebench.
> There was a 1-1.5% increase in CPU utilization. There is a description of this
> result in the paper. I have also tuned filebench to hit the cache often in my
> experiments (more than 60-70% of accesses going to less than 10% of the data),
> but the results were similar. For the Hadoop and Spark experiments we used a
> 24-node cluster and these overhead numbers didn't seem to affect performance.

I'd say testing with _more_ cores is important, not less.  How about
trying to watch a single file, then have one process per core writing to
a 1MB file in a loop.  What does that do?  Or, heck, just compile a
kernel on a modern 2-socket system.

In any case, I still can't see the current duet _model_ ever working
out, much less the implementation posted here.

>> The current tracing code has a model where the trace producers put data
>> in *one* place, then all the mulitple consumers pull it out of that
>> place.  Duet seems to have the model that the producer puts the data in
>> multiple places and consumers consume it from their own private copies.
>>  That seems a bit backwards and puts cost directly in to hot code paths.
>>  Even a single task watching a single file on the system makes everyone
>> go in and pay some of this cost for every SetPageDirty().
> 
> Duet operates in a similar way. There is one large global hash table to avoid
> collisions, so that on average a single lookup is sufficient to place a page 
> in
> it. Due to its global nature, if a page is of interest to multiple tasks, only
> one entry is used to hold the events for that page across all tasks. And to
> avoid walking that hash table for relevant events on a read(), each task
> maintains a separate bitmap of the hash table's buckets that tells it which
> buckets to look into. (In the past I've also tried a work queue approach on 
> the
> hot code path, but the overhead was almost double as a result of allocating 
> the
> work queue items.)

I don't think Duet operates in a similar way.  Asserting that it does
makes me wary that you've understood and actually considered how tracing
works.

Duet takes global locks in hot paths.  The tracing code doesn't do that.
 It uses percpu buffers that don't require shared locks when adding
records.  It makes the reader of the data do all the hard work.

>> Tasks seem to have a fixed 

Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-28 Thread George Amvrosiadis
On Thu, Jul 28, 2016 at 02:02:45PM -0700, Dave Hansen wrote:
> On 07/25/2016 08:47 PM, George Amvrosiadis wrote:
> >  21 files changed, 2424 insertions(+), 1 deletion(-)
> 
> I like the idea, but yikes, that's a lot of code.
> 
> Have you considered using or augmenting the kernel's existing tracing
> mechanisms?  Have you considered using something like netlink for
> transporting the data out of the kernel?
>

We contemplated a couple other solutions. One was extending existing debugging
mechanisms. E.g., there are already tracepoints at __add_to_page_cache_locked()
and __delete_from_page_cache(). A consistent reaction I got for doing that,
however, was that of exposing an unruly interface to user applications, which
is certainly not an elegant solution. After that experience, and reading up on
issues of the auditing subsystem (https://lwn.net/Articles/600568/) I decided
to avoid going further down that path.

> The PageDirty() hooks look simple but turn out to be horribly deep.
> Where we used to have a plain old bit set, we now have new locks,
> potentially long periods of irq disabling, and loops over all the tasks
> doing duet, even path lookup!
> 

It's true that the hooks are deep, but they are fully exercised only once per
inode, per task. The reasoning behind the 'struct bmap_rbnode'->seen bitmap is
to remember whether an inode was seen before by a given task. During that first
access is when we do the path lookup to decide whether this inode is relevant
to the task and mark 'struct bmap_rbnode'->relv accordingly. If it is not, we
ignore future events from it. Tasks can also use the 'structu bmap_rbnode'->done
bitmap to indicate that they are done with a specific inode, which also stops
those events from passing that task loop.

> Given a big system, I would imagine these locks slowing down
> SetPageDirty() and things like write() pretty severely.  Have you done
> an assessment of the performance impact of this change?   I can't
> imagine this being used in any kind of performance or
> scalability-sensitive environment.
> 

I have used filebench to saturate an HDD and an SSD, registered a task to be
notified about every file in the filesystem, and measured no difference in I/O
throughput. To measure the CPU utilization of Duet, I tried an extreme case
where I booted using only one core and again saturated an HDD using filebench.
There was a 1-1.5% increase in CPU utilization. There is a description of this
result in the paper. I have also tuned filebench to hit the cache often in my
experiments (more than 60-70% of accesses going to less than 10% of the data),
but the results were similar. For the Hadoop and Spark experiments we used a
24-node cluster and these overhead numbers didn't seem to affect performance.

> The current tracing code has a model where the trace producers put data
> in *one* place, then all the mulitple consumers pull it out of that
> place.  Duet seems to have the model that the producer puts the data in
> multiple places and consumers consume it from their own private copies.
>  That seems a bit backwards and puts cost directly in to hot code paths.
>  Even a single task watching a single file on the system makes everyone
> go in and pay some of this cost for every SetPageDirty().
> 

Duet operates in a similar way. There is one large global hash table to avoid
collisions, so that on average a single lookup is sufficient to place a page in
it. Due to its global nature, if a page is of interest to multiple tasks, only
one entry is used to hold the events for that page across all tasks. And to
avoid walking that hash table for relevant events on a read(), each task
maintains a separate bitmap of the hash table's buckets that tells it which
buckets to look into. (In the past I've also tried a work queue approach on the
hot code path, but the overhead was almost double as a result of allocating the
work queue items.)

Having said all the above, Dave, I've seen your work at the 2013 Linux Plumbers
Conference on scalability issues, so if you think I'm missing something in my
replies, please call me out on that. I'm definitely open to improving this code.

> Let's say we had a big system with virtually everything sitting in the
> page cache.  Does duet have a way to find things currently _in_ the
> cache, or only when things move in/out of it?
> 

At task registration time we grab the superblock for the filesystem of the
registered path, and then scan_page_cache() traverses the list of inodes
currently in memory. We enqueue ADDED and DIRTY events for relevant inodes
as needed.

> Tasks seem to have a fixed 'struct path' ->regpath at duet_task_init()
> time.  The code goes page->mapping->inode->i_dentry and then tries to
> compare that with the originally recorded path.  Does this even work in
> the face of things like bind mounts, mounts that change after
> duet_task_init(), or mounting a fs with a different superblock
> underneath a watched path?  It seems awfully fragile.

This 

Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-28 Thread George Amvrosiadis
On Thu, Jul 28, 2016 at 02:02:45PM -0700, Dave Hansen wrote:
> On 07/25/2016 08:47 PM, George Amvrosiadis wrote:
> >  21 files changed, 2424 insertions(+), 1 deletion(-)
> 
> I like the idea, but yikes, that's a lot of code.
> 
> Have you considered using or augmenting the kernel's existing tracing
> mechanisms?  Have you considered using something like netlink for
> transporting the data out of the kernel?
>

We contemplated a couple other solutions. One was extending existing debugging
mechanisms. E.g., there are already tracepoints at __add_to_page_cache_locked()
and __delete_from_page_cache(). A consistent reaction I got for doing that,
however, was that of exposing an unruly interface to user applications, which
is certainly not an elegant solution. After that experience, and reading up on
issues of the auditing subsystem (https://lwn.net/Articles/600568/) I decided
to avoid going further down that path.

> The PageDirty() hooks look simple but turn out to be horribly deep.
> Where we used to have a plain old bit set, we now have new locks,
> potentially long periods of irq disabling, and loops over all the tasks
> doing duet, even path lookup!
> 

It's true that the hooks are deep, but they are fully exercised only once per
inode, per task. The reasoning behind the 'struct bmap_rbnode'->seen bitmap is
to remember whether an inode was seen before by a given task. During that first
access is when we do the path lookup to decide whether this inode is relevant
to the task and mark 'struct bmap_rbnode'->relv accordingly. If it is not, we
ignore future events from it. Tasks can also use the 'structu bmap_rbnode'->done
bitmap to indicate that they are done with a specific inode, which also stops
those events from passing that task loop.

> Given a big system, I would imagine these locks slowing down
> SetPageDirty() and things like write() pretty severely.  Have you done
> an assessment of the performance impact of this change?   I can't
> imagine this being used in any kind of performance or
> scalability-sensitive environment.
> 

I have used filebench to saturate an HDD and an SSD, registered a task to be
notified about every file in the filesystem, and measured no difference in I/O
throughput. To measure the CPU utilization of Duet, I tried an extreme case
where I booted using only one core and again saturated an HDD using filebench.
There was a 1-1.5% increase in CPU utilization. There is a description of this
result in the paper. I have also tuned filebench to hit the cache often in my
experiments (more than 60-70% of accesses going to less than 10% of the data),
but the results were similar. For the Hadoop and Spark experiments we used a
24-node cluster and these overhead numbers didn't seem to affect performance.

> The current tracing code has a model where the trace producers put data
> in *one* place, then all the mulitple consumers pull it out of that
> place.  Duet seems to have the model that the producer puts the data in
> multiple places and consumers consume it from their own private copies.
>  That seems a bit backwards and puts cost directly in to hot code paths.
>  Even a single task watching a single file on the system makes everyone
> go in and pay some of this cost for every SetPageDirty().
> 

Duet operates in a similar way. There is one large global hash table to avoid
collisions, so that on average a single lookup is sufficient to place a page in
it. Due to its global nature, if a page is of interest to multiple tasks, only
one entry is used to hold the events for that page across all tasks. And to
avoid walking that hash table for relevant events on a read(), each task
maintains a separate bitmap of the hash table's buckets that tells it which
buckets to look into. (In the past I've also tried a work queue approach on the
hot code path, but the overhead was almost double as a result of allocating the
work queue items.)

Having said all the above, Dave, I've seen your work at the 2013 Linux Plumbers
Conference on scalability issues, so if you think I'm missing something in my
replies, please call me out on that. I'm definitely open to improving this code.

> Let's say we had a big system with virtually everything sitting in the
> page cache.  Does duet have a way to find things currently _in_ the
> cache, or only when things move in/out of it?
> 

At task registration time we grab the superblock for the filesystem of the
registered path, and then scan_page_cache() traverses the list of inodes
currently in memory. We enqueue ADDED and DIRTY events for relevant inodes
as needed.

> Tasks seem to have a fixed 'struct path' ->regpath at duet_task_init()
> time.  The code goes page->mapping->inode->i_dentry and then tries to
> compare that with the originally recorded path.  Does this even work in
> the face of things like bind mounts, mounts that change after
> duet_task_init(), or mounting a fs with a different superblock
> underneath a watched path?  It seems awfully fragile.

This 

Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-28 Thread Dave Hansen
On 07/25/2016 08:47 PM, George Amvrosiadis wrote:
>  21 files changed, 2424 insertions(+), 1 deletion(-)

I like the idea, but yikes, that's a lot of code.

Have you considered using or augmenting the kernel's existing tracing
mechanisms?  Have you considered using something like netlink for
transporting the data out of the kernel?

The PageDirty() hooks look simple but turn out to be horribly deep.
Where we used to have a plain old bit set, we now have new locks,
potentially long periods of irq disabling, and loops over all the tasks
doing duet, even path lookup!

Given a big system, I would imagine these locks slowing down
SetPageDirty() and things like write() pretty severely.  Have you done
an assessment of the performance impact of this change?   I can't
imagine this being used in any kind of performance or
scalability-sensitive environment.

The current tracing code has a model where the trace producers put data
in *one* place, then all the mulitple consumers pull it out of that
place.  Duet seems to have the model that the producer puts the data in
multiple places and consumers consume it from their own private copies.
 That seems a bit backwards and puts cost directly in to hot code paths.
 Even a single task watching a single file on the system makes everyone
go in and pay some of this cost for every SetPageDirty().

Let's say we had a big system with virtually everything sitting in the
page cache.  Does duet have a way to find things currently _in_ the
cache, or only when things move in/out of it?

Tasks seem to have a fixed 'struct path' ->regpath at duet_task_init()
time.  The code goes page->mapping->inode->i_dentry and then tries to
compare that with the originally recorded path.  Does this even work in
the face of things like bind mounts, mounts that change after
duet_task_init(), or mounting a fs with a different superblock
underneath a watched path?  It seems awfully fragile.


Re: [PATCH 0/3] new feature: monitoring page cache events

2016-07-28 Thread Dave Hansen
On 07/25/2016 08:47 PM, George Amvrosiadis wrote:
>  21 files changed, 2424 insertions(+), 1 deletion(-)

I like the idea, but yikes, that's a lot of code.

Have you considered using or augmenting the kernel's existing tracing
mechanisms?  Have you considered using something like netlink for
transporting the data out of the kernel?

The PageDirty() hooks look simple but turn out to be horribly deep.
Where we used to have a plain old bit set, we now have new locks,
potentially long periods of irq disabling, and loops over all the tasks
doing duet, even path lookup!

Given a big system, I would imagine these locks slowing down
SetPageDirty() and things like write() pretty severely.  Have you done
an assessment of the performance impact of this change?   I can't
imagine this being used in any kind of performance or
scalability-sensitive environment.

The current tracing code has a model where the trace producers put data
in *one* place, then all the mulitple consumers pull it out of that
place.  Duet seems to have the model that the producer puts the data in
multiple places and consumers consume it from their own private copies.
 That seems a bit backwards and puts cost directly in to hot code paths.
 Even a single task watching a single file on the system makes everyone
go in and pay some of this cost for every SetPageDirty().

Let's say we had a big system with virtually everything sitting in the
page cache.  Does duet have a way to find things currently _in_ the
cache, or only when things move in/out of it?

Tasks seem to have a fixed 'struct path' ->regpath at duet_task_init()
time.  The code goes page->mapping->inode->i_dentry and then tries to
compare that with the originally recorded path.  Does this even work in
the face of things like bind mounts, mounts that change after
duet_task_init(), or mounting a fs with a different superblock
underneath a watched path?  It seems awfully fragile.


[PATCH 0/3] new feature: monitoring page cache events

2016-07-25 Thread George Amvrosiadis
I'm attaching a patch set implementing a mechanism we call Duet, which allows
applications to monitor events at the page cache level: page additions,
removals, dirtying, and flushing. Using such events, applications can identify
and prioritize processing of cached data, thereby reducing their I/O footprint.

One user of these events are maintenance tasks that scan large amounts of data
(e.g., backup, defrag, scrubbing). Knowing what is currently cached allows them
to piggy-back on each other and other applications running in the system. I've
managed to run up to 3 such applications together (backup, scrubbing, defrag)
and have them finish their work with 1/3rd of the I/O by using Duet. In this
case, the task that traversed the data the fastest (scrubber) allowed the rest
of the tasks to piggyback on the data brought into the cache. I.e., a file that
was read to be backed up was also picked up by the scrubber and defrag process.

I've found adapting applications to be straight-forward. Although I don't
include examples in this patch set, I've adapted btrfs scrubbing, btrfs send
(backup), btrfs defrag, rsync, and f2fs garbage collection in a few hundred
lines of code each (basically just had to add an event handler and wire it up
to the task's processing loop). You can read more about this in our full paper:
http://dl.acm.org/citation.cfm?id=2815424. I'd be happy to generate subsequent
patch sets for individual tasks if there's interest in this one. We've also
used Duet to speed up Hadoop and Spark by taking into account cache residency
of HDFS blocks across the cluster, when scheduling tasks, by up to 54%
depending on overlap on the data processed:
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/deslauriers


Syscall interface (and how it works): Duet uses hooks into the page cache (see
the "mm: support for duet hooks" patch). These hooks inform Duet of page events,
which are stored in a hash table. Only events that are of interest to running
tasks are stored, and only one copy of each event is stored for all interested
tasks. To register for events, the following syscalls are used (see the
"mm/duet: syscall wiring" patch for prototypes):

- sys_duet_init(char *taskname, u32 regmask, char *path): returns an fd that
  watches for events under PATH (e.g. '/home') and are also described in the
  REGMASK (e.g. DUET_PAGE_ADDED | DUET_PAGE_REMOVED). TASKNAME is an optional,
  human-readable name for the task.

- sys_duet_bmap(u16 flags, struct duet_uuid_arg *uuid): Duet allows applications
  to track processed items on an internal bitmap (which improves performance by
  being used to filter unnecessary events). The specified UUID is what read()
  returns on the fd created with sys_duet_init(), and uniquely identifies a
  file. FLAGS allow the bitmap to be set, reset, or have its state checked.

- sys_duet_get_path(struct duet_uuid_arg *uuid, char *buf, int bufsize):
  Applications running with Duet do not understand UUIDs, but pathnames. This
  syscall traverses the dentry cache and returns the corresponding path in BUF.

- sys_duet_status(u16 flags, struct duet_status_args *arg): Currently, the Duet
  framework can be turned on/off manually. This allows the admin to specify the
  number of max applications that will be registered concurrently, which allows
  us to size the internal hash table nodes appropriately (and limit performance
  or memory overhead). The syscall is also used for debugging purposes. I think
  this functionality should probably be exposed through ioctl()s to a device,
  and I'm open to suggestions on how to improve the current implementation.

The framework itself (a bit less than 2300 LoC) is currently placed under
mm/duet and the code is included in the "mm/duet: framework code" patch.


Application interface: Applications interface with Duet through a user library,
which is available at https://github.com/gamvrosi/duet-tools. In the same repo,
I have included a dummy_task application which provides an example of how Duet
can be used.


Changelog: The patches are based on Linus' v4.7 tag, and touch on the following
parts of the kernel:

- mm/filemap.c and include/linux/page-flags.h: hooks in the page cache to track
  page events on page addition, removal, dirtying, and flushing.

- arch/x86/*, include/linux/syscalls.h, kernel/sys_ni.h: wiring the 4 syscalls

- mm/duet/*: framework code



George Amvrosiadis (3):
  mm: support for duet hooks
  mm/duet: syscall wiring
  mm/duet: framework code

 arch/x86/entry/syscalls/syscall_32.tbl |   4 +
 arch/x86/entry/syscalls/syscall_64.tbl |   4 +
 include/linux/duet.h   |  43 +++
 include/linux/page-flags.h |  53 +++
 include/linux/syscalls.h   |   8 +
 include/uapi/asm-generic/unistd.h  |  12 +-
 init/Kconfig   |   2 +
 kernel/sys_ni.c|   6 +
 mm/Makefile|   1 +
 mm/duet/Kconfig   

[PATCH 0/3] new feature: monitoring page cache events

2016-07-25 Thread George Amvrosiadis
I'm attaching a patch set implementing a mechanism we call Duet, which allows
applications to monitor events at the page cache level: page additions,
removals, dirtying, and flushing. Using such events, applications can identify
and prioritize processing of cached data, thereby reducing their I/O footprint.

One user of these events are maintenance tasks that scan large amounts of data
(e.g., backup, defrag, scrubbing). Knowing what is currently cached allows them
to piggy-back on each other and other applications running in the system. I've
managed to run up to 3 such applications together (backup, scrubbing, defrag)
and have them finish their work with 1/3rd of the I/O by using Duet. In this
case, the task that traversed the data the fastest (scrubber) allowed the rest
of the tasks to piggyback on the data brought into the cache. I.e., a file that
was read to be backed up was also picked up by the scrubber and defrag process.

I've found adapting applications to be straight-forward. Although I don't
include examples in this patch set, I've adapted btrfs scrubbing, btrfs send
(backup), btrfs defrag, rsync, and f2fs garbage collection in a few hundred
lines of code each (basically just had to add an event handler and wire it up
to the task's processing loop). You can read more about this in our full paper:
http://dl.acm.org/citation.cfm?id=2815424. I'd be happy to generate subsequent
patch sets for individual tasks if there's interest in this one. We've also
used Duet to speed up Hadoop and Spark by taking into account cache residency
of HDFS blocks across the cluster, when scheduling tasks, by up to 54%
depending on overlap on the data processed:
https://www.usenix.org/conference/hotstorage16/workshop-program/presentation/deslauriers


Syscall interface (and how it works): Duet uses hooks into the page cache (see
the "mm: support for duet hooks" patch). These hooks inform Duet of page events,
which are stored in a hash table. Only events that are of interest to running
tasks are stored, and only one copy of each event is stored for all interested
tasks. To register for events, the following syscalls are used (see the
"mm/duet: syscall wiring" patch for prototypes):

- sys_duet_init(char *taskname, u32 regmask, char *path): returns an fd that
  watches for events under PATH (e.g. '/home') and are also described in the
  REGMASK (e.g. DUET_PAGE_ADDED | DUET_PAGE_REMOVED). TASKNAME is an optional,
  human-readable name for the task.

- sys_duet_bmap(u16 flags, struct duet_uuid_arg *uuid): Duet allows applications
  to track processed items on an internal bitmap (which improves performance by
  being used to filter unnecessary events). The specified UUID is what read()
  returns on the fd created with sys_duet_init(), and uniquely identifies a
  file. FLAGS allow the bitmap to be set, reset, or have its state checked.

- sys_duet_get_path(struct duet_uuid_arg *uuid, char *buf, int bufsize):
  Applications running with Duet do not understand UUIDs, but pathnames. This
  syscall traverses the dentry cache and returns the corresponding path in BUF.

- sys_duet_status(u16 flags, struct duet_status_args *arg): Currently, the Duet
  framework can be turned on/off manually. This allows the admin to specify the
  number of max applications that will be registered concurrently, which allows
  us to size the internal hash table nodes appropriately (and limit performance
  or memory overhead). The syscall is also used for debugging purposes. I think
  this functionality should probably be exposed through ioctl()s to a device,
  and I'm open to suggestions on how to improve the current implementation.

The framework itself (a bit less than 2300 LoC) is currently placed under
mm/duet and the code is included in the "mm/duet: framework code" patch.


Application interface: Applications interface with Duet through a user library,
which is available at https://github.com/gamvrosi/duet-tools. In the same repo,
I have included a dummy_task application which provides an example of how Duet
can be used.


Changelog: The patches are based on Linus' v4.7 tag, and touch on the following
parts of the kernel:

- mm/filemap.c and include/linux/page-flags.h: hooks in the page cache to track
  page events on page addition, removal, dirtying, and flushing.

- arch/x86/*, include/linux/syscalls.h, kernel/sys_ni.h: wiring the 4 syscalls

- mm/duet/*: framework code



George Amvrosiadis (3):
  mm: support for duet hooks
  mm/duet: syscall wiring
  mm/duet: framework code

 arch/x86/entry/syscalls/syscall_32.tbl |   4 +
 arch/x86/entry/syscalls/syscall_64.tbl |   4 +
 include/linux/duet.h   |  43 +++
 include/linux/page-flags.h |  53 +++
 include/linux/syscalls.h   |   8 +
 include/uapi/asm-generic/unistd.h  |  12 +-
 init/Kconfig   |   2 +
 kernel/sys_ni.c|   6 +
 mm/Makefile|   1 +
 mm/duet/Kconfig